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Introduction 


This  book  constitutes  the  proceedings  of  the  most  recent  (April  12- 
13,  1961)  of  a  continuing  series  of  symposia  on  Information  and  Decision 
Processes  held  at  Purdue  University,  where  recent  developments  in  this 
field  are  reported  on  by  leading  experts.  The  book  Information  and  Deci- 
sion Processes,  published  under  the  editorship  of  one  of  us  (R.  E.M.)  in 
1960,  reports  the  preceding  symposium. 

As  originally  announced,  this  symposium  was  to  have  had  ten  speak- 
ers. The  papers  of  all  ten  are  included  here,  although  only  eight  of  them 
were  actually  presented  at  the  conference.  In  addition,  an  extension  of 
the  highly  pertinent  extempore  remarks  of  Dean  Tribus  are  included.  The 
two  papers  that  were  not  presented  were  those  of  Prof.  Robbins,  who  was 
not  able  to  be  present,  and  Prof.  Harry  H.  Goode  of  The  University  of 
Michigan,  who  died  suddenly  shortly  before  the  conference.  His  paper 
was  completed  by  his  colleagues,  as  explained  therein.  This  volume  is 
affectionately  dedicated  to  his  memory. 

Five  of  our  contributors  are  professional  mathematicians.  And  even 
those  who  are  not  —  three  engineers,  one  economist,  one  philosopher  (a 
logician),  and  one  professor  of  business  administration  —  are  mathemati- 
cians, too.  This  is  no  coincidence.  The  field  that  we  are  here  exploring, 
while  related  to  some  of  the  oldest  epistemological  questions,  is  yielding 
new  results  of  great  pragmatic  significance  precisely  because  it  has  been 
reformulated  in  mathematical  terms  and  examined  by  mathematical  tech- 
niques. 

The  basic  question  that  we  are  examining  is  how  to  make  an  optimum 
(or  near-optimum)  decision  in  the  presence  of  uncertainty.  The  three  key 
words,  " optimum,' '  "decision,"  and  "uncertainty,"  deserve  examination. 

The  word  "optimum"  implies  that  an  objective  function  shall  be 
maximized  (or  minimized  —  there  is  no  significant  difference).  The  maxi- 
mization is  classically  the  mathematical  problem,  and  the  formulation  of 
the  objective  function  has  been  based  on  intuition,  or  made  the  jurisdic- 
tion of  a  "manager,"  or  perhaps  been  the  subject  of  operations  research. 
Without  deprecating  the  formal  mathematical  difficulties  involved  in  such 
problems  (some  of  which  appear  in  this  volume),  it  is  the  formulation  that 
is  most  difficult. 
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The  difficulties  are  illuminated  by  a  story  told  by  C.  J.  Hitch  con- 
cerning a  fat  colleague  who  applied  modern  decision-theory  techniques 
(including  a  digital  computer)  to  the  determination  of  an  optimum  diet  by 
which  he  might  lose  weight.  The  solution  which  the  computer  yielded  was 
to  drink  80  gallons  of  vinegar  a  day.  This  was  clearly  not  a  feasible 
solution  (in  the  nontechnical  sense  —  it  was  perfectly  feasible  in  terms 
of  the  rules  that  he  had  explicitly  formulated)  and  so  he  changed  the 
rules  and  tried  again.  The  difficulty  in  using  these  techniques  on  impor- 
tant problems  seems  to  be  that  in  many  such  problems  "our  intuition  does 
not  reach  very  powerfully,  and  it  therefore  is  not  so  easy  to  recognize 
vinegary  answers."  But  we  are  getting  ever  more  sophisticated  in  these 
techniques  and  in  the  methods  of  applying  them.  And  while  there  is  no 
immediate  hope  of  quantitatively  optimizing  our  national  budget  or  our 
over-all  military  posture,  the  region  of  nontrivial  matters  in  which  we  can 
apply  objective  and  quantitative  methods  is  steadily  increasing. 

One  promising  approach  consists  in  developing  new  mathematical 
models  that  permit  formulation  of  the  problem  in  a  method  that  will  allow 
the  maximization  to  be  performed  more  easily  —  and  this  is  essentially 
the  subject  of  Bellman's  article  on  dynamic  programming.  An  advantage 
of  dynamic  programming  is  that  the  resulting  model  can  be  solved  more 
readily,  or  with  less  approximation,  than  by  classical  methods,  and  hope- 
fully other  such  techniques  can  be  evolved. 

The  word  "decision"  implies  that  a  choice  shall  be  made  from  among 
a  set  of  feasible  actions.  This  constrains  the  discussion  to  situations  in 
which  some  activity  must  eventuate,  and  thus  avoids  the  type  of  meta- 
physical circularity  which  is  anathema  to  engineers  and  operationally 
oriented  mathematicians.  This  is  another  way  of  saying  that  each  of  our 
contributors  is  a  practical  man  and  that  this  volume  is  based  on  the 
search  for  utility. 

Several  of  our  contributors  (Dunham,  Goode,  Moriguti,  and  Wiener) 
are  concerned  with  the  application  of  advanced  computers  (which  appear 
by  now  to  be  synonymous  with  digital  computers)  to  the  making  of  these 
decisions.  One  of  them  (Wiener)  is  additionally  concerned  with  the  pos- 
sibility that  computer-made  decisions  may  be  dangerous.  The  analysis 
and  synthesis  of  computational  decision-making  procedures  is  central,  if 
not  essential,  to  the  problems  with  which  we  are  concerned,  for  it  is 
precisely  when  the  method  can  be  made  objective  and,  further,  algorithmic, 
that  the  particular  subproblem  in  hand  can  be  said  to  be  understood  and 
therefore  essentially  solved. 

The  word  "uncertainty"  is  usually  taken  to  imply  probabilities  that 
are  subject  to  objective  determination,  but  this  is  really  only  a  mathe- 
matical convenience.  Herman  Kahn  classifies  uncertainty  in  three  cate- 
gories: statistical  uncertainty,  "events  whose  probability  of  occurrence 
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is  more  or  less  objective";  real  uncertainty,  "events  to  which  individuals 
may  attach  subjective  probability  but  to  which  there  is  essentially  no 
hope  at  present  of  getting  much  general  agreement  as  to  what  these  proba- 
bilities are";  and  "uncertainty  due  to  enemy  action."  These  categories 
are  clearly  not  mutually  exclusive.  Anyone  who  tries  to  build  a  random- 
number  generator  and  then  justify  its  randomness  will  probably  come  to 
the  conclusion  that  the  first  category  (statistical  uncertainty)  is  extremely 
rare,  if  it  exists  at  all.  There  is  a  gradual  transition  between  more  or 
less  deterministic  situations  and  those  in  which  the  outcomes  are  un- 
known or,  to  all  intents  and  purposes,  unknowable.  Thus,  statistical  me- 
chanics was  formulated  not  because  Gibbs  believed  that  the  state  was 
unknowable,  but  because  statistical  techniques  with  the  implied  hypothe- 
sis of  uncertainty  were  convenient. 

It  is  possible  to  prove  that  the  result  of  a  move  in  chess  is  "know- 
able,"  and  it  is  also  possible  to  prove  that  it  is  "unknowable."  The 
first  statement  follows  from  the  fact  that  chess  is  a  two-person  zero-sum 
game  with  perfect  information,  and  thus  that  each  side  has  at  any  point  at 
least  one  optimum  pure  strategy.  The  second  statement  (extending  an 
argument  due  to  George  W.  Brown)  follows  from  the  fact  that  to  use  this 
theorem  constructively  one  must  enumerate  the  available  strategies.  Now 
one  can  make  a  pretty  good  estimate  of  the  total  number  of  these  strate- 
gies, and  show  that  it  is  larger  than  the  largest  number  of  computations 
which  can  theoretically  be  made.  For  to  make  such  computations,  one 
must  define  a  computer;  and  no  computer  can  make  a  single  computation 
in  less  time  than  the  elementary  physical  unit  of  time  —  say,  the  duration 
of  the  passage  of  light  across  a  nucleus;  nor  can  a  computer  make  more 
simultaneous  computations  than  it  has  parts;  nor  can  it  have  more  parts 
than  there  are  electrons  in  the  universe;  nor  can  it  have  been  computing 
longer  than  since  the  beginning  of  the  universe.  While  this  supplies  a 
most  conservative  upper  bound,  it  is  not  a  very  large  number  —  its  natural 
logarithm  is  less  than  three  hundred  —  and  it  seems  clear  from  the  opera- 
tional viewpoint  (in  the  Bridgman  sense)  that  anything  which  requires 
more  computational  operations  than  this  must  be  theoretically  unknowable. 

It  may  be  that  there  are  some  physical  phenomena  (e.g.,  radioactive 
decay)  which  are  truly  stochastic,  but  in  the  great  bulk  of  situations 
which  we  examine  by  the  techniques  of  this  volume,  the  probabilistic 
interpretation  represents  the  only  way  we  know  for  describing  quantita- 
tively our  quasi-ignorance.  Of  course,  most  of  the  laws  of  probability 
(e.g.,  the  central-limit  theorem)  yield  acceptably  accurate  results.  Thus, 
for  example,  statistical  tests  applied  to  digital-computer  generation  of 
pseudorandom  numbers  show  that  the  assumptions  usually  made  in  the 
manipulations  of  such  numbers  are  justified  in  most  cases.  However,  in 
applying  our  models  to  real  situations,  we  must  be  particularly  cautious 
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with  such  assumptions  as  that  of  independence.  Doubt  as  to  the  useful- 
ness of  probability  in  measuring  real  uncertainty  led  to  the  rejection  of 
Bayes'  theorem  as  the  basis  of  scientific  inference  during  the  early  part 
of  this  century.  These  very  considerations  are  related  to  the  vigorous 
renascence  of  the  Bayesians  as  typified  by  two  of  the  papers  (Raiffa, 
Savage)  in  this  volume. 

The  scope  and  applicability  of  these  decision-theory  and  information- 
theory  techniques  is  enormous.  It  extends  beyond  the  obvious  problems, 
such  as  distinguishing  between  signal  and  noise,  and  beyond  the  organ- 
ized usage  under  such  names  as  system  engineering  or  operations  re- 
search, which  have  recently  become  such  important  disciplines.  Every 
industrial  or  commercial  manager,  every  politician,  every  military  man, 
and  even  most  engineers  spend  most  of  their  time  trying  to  obtain  more 
information  pertinent  to  the  problem  at  hand,  and  finally  making  a  decision 
on  the  information  which  is  available.  It  would  be  fatuous  to  assert  that 
they  use  quantitative  optimization  techniques  in  the  majority  of  cases,  or 
even  that  they  might  hope  to  do  so.  But  certainly  there  is  ample  room  in 
that  direction,  and  we  are  moving  monotonically  toward  the  quantification 
and  objectivization  of  our  decision  making. 

This  is  no  place  to  discuss  whether  such  dehumanization  is  "good" 
(Wiener  has  touched  on  this  question  here  and  elsewhere),  but  there  is 
another  aspect  which  must  appeal  to  the  philosopher.  For  the  decision- 
making process  is  one  of  the  ultimate  behavioral  modes  of  sentient 
beings,  and  in  explicating  this  process,  we  are  throwing  light  on  man 
himself. 

We  thus  assert  that  recent  developments  in  information  and  decision 
processes  are  matters  of  great  interest  and  significance,  and  are  proud  to 
present  the  following  group  of  important  contributions  to  this  field. 

Robert  E.  Machol 
Paul  Gray 
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The  Mathematics  of 
Self-Organizing  Systems 


I'm  going  to  discuss  two  problems  that  concern  self-organization:  the 
problem  of  making  a  machine  that  will  make  another  machine  in  its  own 
image,  and  the  problem  of  machines  that  learn.  Later  on  in  this  paper  I 
shall  show  that  these  two  problems  have  more  in  common  than  one 
might  believe. 


MACHINESTHAT  MAKE  OTHER  MACHINES  IN  THEIR  MAGE 

There  are  two  points  of  view  on  such  machines  although  there  are  a 
good  many  more  than  two  people  at  work  on  them.  One,  stemming  from 
Von  Neumann,  is  essentially  combinatorial.  It  is  concerned  primarily 
with  showing  that  there  is  no  combinatorial  impossibility  in  a  machine 
making  others  in  its  own  image.  Here  the  image  is  conceived  as  a  quasi- 
static  pattern  which  represents  the  machine  and  can  possibly  make  other 
patterns  like  it.  In  this  approach,  the  fact  that  the  pattern  is  an  operative 
one,  performing  certain  functions,  does  not  come  into  the  discussion. 

This  is  not  the  point  of  view  Pm  taking  here.  We  shall  consider  the 
machine  as  an  operative  pattern  that  will  make  other  operative  machines 
in  its  own  image  by  the  proper  combination.  This  method  is  related  to 
work  done  by  Gabor  in  England  [1]  and  by  others. 

When  we  say  "make  other  machines  in  its  own  image,"  we  are  com- 
bining two  things:  the  problem  of  machine  analysis  and  the  problem  of 
machine  synthesis.  If  we  can  handle  the  analysis  and  the  synthesis 
separately,  there  is  no  great  difficulty  in  tying  them  together  into  one 
problem  and  using  the  results  in  the  construction  of  another  machine 
(with  the  same  analysis  as  the  original)  by  synthesis. 
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To  discuss  these  machines,  I  must  first  specify  the  sort  of  machines 
I'm  working  with.  Every  machine  can  be  conceived  of  as  having  a  number 
of  inputs  from  the  outer  world  which  it  combines  to  provide  a  number  of 
outputs  to  the  outer  world  (Figure  1).  This  generalization  is  true  whether 


Inputs 


General 
machine 


Outputs 


Figure  1. 

the  machine  is  electrical,  mechanical,  or  of  any  other  sort.  It  is  essen- 
tially a  coding  problem,  i.e.,  a  coding  of  the  inputs  to  obtain  a  desired 
output,  depending,  in  a  definite  way,  on  the  inputs.  The  simplest  case  is 
one  input  and  one  output  depending  on  the  past  of  the  input,  not  neces- 
sarily linearly  (Figure  2).  This  one-input,  one-output  device  I  shall  call 
a  transducer,  in  general  a  nonlinear  transducer. 

I  shall  now  show  how  the  problems  of  analysis  and  synthesis  of  such 
machines  can  be  solved  individually,  and  how  the  solutions  can  be  com- 
bined so  that  we  can  make  an  indeterminate  nonlinear  transducer.  By 
proper  setting,  such  a  transducer  could  be  any  nonlinear  transducer. 
I  shall  go  on  to  show  how,  by  tying  this  transducer  together  with  a  given 
linear  transducer,  a  black  box  will  become  a  transducer. 


One  input 


Nonlinear 

transducer  on 

past  inputs 


One  output 


Figure  2. 

The  transducer  will  perhaps  not  be  a  picture  or  physical  replica  of  the 
black  box,  but  it  will  be  operatively  equivalent.  That  is,  although  they 
may  not  look  alike,  they  will  make  the  same  transformation  on  an  incoming 
message. 

THEORY  OF  NONLINEAR  TRANSDUCERS 


Suppose  we  have  a  function  of  time,  fit)  and  an  operator  F(f(t),r)  acting 
on  it.    This  operator  is  a  functional  of  the  input  and  of  the  output  time 
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(r),  which  occurs  after  the  input.  I  can  get  various  operations.  For  ex- 
ample, I  might  take  the  integral 


£ 


k(t~r)cf>(T)dT 


This  integral  is  a  linear  operator  on  the  past.  I  might  take  a  nonlinear 
operator  such  as  the  following  quadratic,  homogeneous  operator: 

JmOO  J.OO 

I      k2(t  -rlt  t  -T2)<f> (t{) cf) (r2) dTidr2 
o      Jo 

This  class  of  operators  on  the  past  translates  when  I  translate  the 
functional  F. 

Among  these  operators  in  the  nonlinear  transducer,  I  shall  consider 
only  those  that  are  stable.  In  terms  of  nonlinear  apparatus,  I  wish  to 
avoid  equipment  that  will  go  into  spontaneous  oscillation  and  not  be 
completely  determined  by  its  own  past. 

To  build  a  theory  of  the  linear  expansion  of  nonlinear  operators,  we 
must  have  a  set  of  standard  nonlinear  operators  that  we  can  combine  with 
variable  and  assignable  coefficients.  Therefore,  the  apparatus  I  shall 
consider,  is  the  following  (Figure  3).  A  number  of  specific  boxes  (linear 
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Busbars 


Linear 
operators 


Figure  3. 


operators)  are  attached  in  parallel  to  a  pair  of  busbars.  A  potentiometer 
take-off  from  the  output  of  each  box  can  be  read  to  give  a  positive  or 
negative  multiple  of  the  output  of  each  box.  We  then  take  these  outputs 
and  combine  them  in  series.  This  procedure  gives  us  a  family  of  opera- 
tions, in  general  nonlinear,  which  are  obtained  by  linear  combinations  of 
a  specific  set  of  nonlinear  operators. 

The  problem,  therefore,  is  to  find  a  set  of  boxes  that  provide  outputs 
that  form  a  significant  expansion.  When  we  combine  functions  to  produce 
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a  larger  set  by  linear  combinations  with  different  coefficients,  the  most 
convenient  case  we  know  is  that  of  orthogonal  functions. 

Can  I  obtain  boxes  which,  in  some  sense  or  other,  are  orthogonal  to 
one  another?  The  answer  is  yes.  Consider  the  two  boxes  shown  in  Figure 
4.  If  these  boxes  have  a  set  of  inputs  with  a  certain  metric  to  them,  I 
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Black 
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^: 
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Multiplier 


Figure  4. 

can  multiply  their  outputs  and  consider  the  distribution  of  the  product. 
If  for  a  certain  metric  of  the  inputs,  the  average  of  the  product  with 
respect  to  the  metric  of  the  distribution  is  zero,  I  shall  say  that  the 
boxes  are  orthogonal. 

The  objective  can  be  achieved  electrically.  The  apparatus  for  multi- 
plying the  outputs  is  merely  an  electrical  multiplier  of  potentials.  The 
desired  multiplication  in  orthogonality  depends  on  having  an  input  of  the 
form  x(t,  a),  where  x(t,  a)  is  the  total  amount  of  electricity  passed  and 
has  the  statistical  distribution  of  Brownian  motion.  Alpha  is  the  param- 
eter of  integration. 

Thus  we  put  x(t,  a)  into  both  the  black  box  and  the  white  box,  multiply 
their  outputs,  and  average  the  output  of  the  multiplier  with  respect  to  a. 
The  white  box  will  be  said  to  be  normal  and  orthogonal  to  the  black  box 
if  this  average  is  zero. 


IMPLEMENTATION  OF  NONLINEAR  TRANSDUCERS 


How  can  we  obtain  such  a  random  x(t,  a)  input?  The  answer  is  it  can 
be  accomplished  very  easily  by  using  the  properties  of  shot-effect  noise. 
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Electricity  is  not  conducted  continuously  but  discretely,  by  electrons. 
The  stream  of  electrons  must  have  irregularities  which,  in  general,  have 
a  Poisson  distribution  in  time  for  constant  currents.  These  fluctuations 
can  be  amplified  and  the  shot  effect  put  to  practical  use  rather  than 
being  a  nuisance,  as  it  generally  is  in  electrical  circuits. 

All  of  the  equipment  I  ani  discussing  is  readily  available.  This  is 
particularly  important,  because  what  I  shall  talk  about  is  an  actual 
hardware  piece  of  apparatus..  Shot-effect  generators  can  be  bought  in  the 
open  market  from  a  number  of  electrical  firms.  Multipliers  of  potential  are 
another  kind  of  existing  apparatus.  Multiplying  and  squaring  are  the  same 
thing  because 

(u  +  v)2-(u-v)2 

uv   = 

4 

Multiplying  by  a  constant  and  adding  can  easily  be  done  with  electrical 
potentials  by  applying  Kirchoff  s  laws  and  Ohm's  law.  Some  of  the  multi- 
pliers are  simply  made  up  of  square-law  rectifiers.  Some  use  the  Hall 
effect.  Others,  made  in  Gabor's  laboratory  by  a  pupil  of  his,  use  the 
piezoelectric  effect. 

We  now  come  to  a  hard  point.  How  do  we  get  the  coefficient,  that  is, 
the  integral  of  the  product  of  the  output  over  phase  (over  the  a  variable)? 
If  we  take  the  Brownian  motion  x(t,  a)  and  change  it  to  x(t  +  r,  a)  -x(t,  a), 
we  obtain: 

x(t  +  t,  a)  -  x(t,  a)    =   x{t,  TTa) 

where  TT  represents  a  family  of  measure-preserving  transformations  on 
the  line  0  to  1  on  which  a  lies. 

Measure-preserving  transformations  are  not  generally  ergodic  because 
they  usually  have  invariant  sets  which  have  a  measure  that  is  neither  1 
nor  0.  However,  it  can  be  proved  easily  and  strictly  that  this  particular 
transformation  is  ergodic.  This  means  that  if  we  take  the  product  of  the 
output,  in  almost  all  cases  and  for  almost  all  a's,  the  time  average  will 
be  the  same  as  the  a  average.  This  time  average  can  be  measured  by 
various  integrating  devices. 

The  ergodicity  of  the  group  TTa  of  measure-preserving  transformations 
of  the  segment  0  <  a  <  1  into  itself,  defined  by 

x(t,  Tra)  =  x(t  +  r,  a)  -  x(t,  a) 

can  be  proved  in  the  following  way.  We  first  notice  that  ergodicity  is 
equivalent  to  the  statement  that  if  S  is  a  measurable  set  of  values  of  a 
which  goes  into  itself  under  all  transformations  of  this  group,  its  measure 
will  be  either  1  or  0.  Moreover,  in  view  of  the  way  in  which  measure  in  a 
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is  defined,  given  any  e  >  0,  there  is  a  measurable  set  S€  of  values  of  a, 
dependent  only  on  the  differences  of  x(t,  a)  over  the  finite  range 

-A<t<A 

and  such  that 

m[(s€ns)  u(snse)]<* 

Under  the  transformation  Tr,  S  remains  S;  let  S€  become  TTS€.  Then, 
because  TT  is  measure  invariant, 

m[(TTS€CiS)U{SnTTS€)]<e 
From  this  it  can  be  proved  that 


m[((Trs,ns,)ns)u(snTrs,ns,)]<2. 

If  r  >  2A, 

m{TTS€riSe)  =  [m{S€)]2 

since  the  range  of  values  of  t  from  -A  to  A  does  not  overlap  the  range 
from  t  -  A  to  t  +  A,  and  the  function  x(t,  a)  has  independent  increments 
for  nonoverlapping  ranges.  Therefore, 

|[m(S,)]2-m(S)|<2  6 
On  the  other  hand, 

\m(S€)-m{S)\<€ 
from  which 

I  [m{S€)V  -  m{S)\2  <  \m(S€)  +  m(S)\\m(S€)  -  m(S)  I  <  2  e 
Thus, 

lm(S)-[m(S)]2l  <4e 
where  e  is  arbitrary.  This  implies 

m(S)-[m(S)]2=0 
Thus,  either 

m(S)=0  or  m(S)  =  l 

This  proves  the  ergodicity  of  the  transformation  group  TTa. 

Two  pieces  of  apparatus  will  be  orthogonal  in  their  output  from  the 
same  Brownian  motion  input  if  the  time  average  of  their  product  is  0. 
One  piece  of  apparatus  which  has  these  properties  is  shown  in  Figure  5. 
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Input 


fZ7c 


Figure  5. 


The  circuit  shown  here  has  two  elements,  but  there  may  be  any  number  in 
succession.  These  devices  are  on  open  circuit.  The  potential  in  each 
will  result  from  the  potential  at  the  input  by  expanding  the  input  potential 
in  Laguerre  functions  of  the  past  to  obtain  the  coefficient  at  a  given  time. 
From  an  engineering  point  of  view,  a  piece  of  apparatus  on  open  circuit 
is  one  that  is  feeding  into  a  very  large  resistance.  This  large  resistance 
is  what  one  finds  when  one  takes  a  reading  off  the  apparatus  through  a 
cathode  follower.  Therefore,  we  can  get  a  number  of  outputs  in  the  past 
that  are  coefficients  of  the  various  Laguerre  polynomials  in  the  Laguerre 
polynomial  expansion  of  the  input.  We  can  go  one  step  further.  We  can 
obtain  Hermite  polynomials  (on  the  proper  scale)  in  the  coefficients  of 
the  Laguerre  developments  of  the  past.  It  can  be  shown  quite  easily 
that,  for  a  random  input,  these  are  normal  and  orthogonal. 


WH/TE  BOX  -  BLACK  BOX  EQUIVALENCE 

Thus,  with  existing  apparatus,  we  can  put  the  same  random  input  into 
the  black  box  and  into  the  white  box.  We  can  then  multiply  their  outputs, 
take  the  time  average  of  the  product,  and  obtain  the  coefficients  that  we 
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would  have  to  put  into  the  potentiometer  connected  to  the  white  box.  We 
then  can  produce  a  system  that  will  be  the  equivalent  of  the  black  box 
for  every  random  input  (Figure  6).  Since  a  random  input  will  (for  any 

Figure  6. 

interval  of  time)  simulate,  sooner  or  later,  arbitrarily  closely  any  input, 
the  equivalence  is  valid  for  a  much  wider  class  of  inputs.  Thus,  although 
we  use  the  random  input  for  the  analysis  and  for  the  synthesis  of  a 
machine  that  makes  another  in  its  own  image,  the  equivalence  we  obtain 
is  valid  for  much  more  than  random  inputs. 

As  already  mentioned,  we  can  get  a  reading  for  each  white  box  of  the 
coefficient  it  will  have  to  have  so  that  it  is  the  proper  term  for  a  black 
box.  Without  great  difficulty,  we  can  make  an  apparatus  of  this  sort  that 
will  set  the  potentiometer  to  exactly  the  value  we  desire. 

What  have  we  done  by  this?  We  have  made  an  apparatus  that  will  allow 
an  indeterminate  system  to  become,  to  as  many  terms  as  we  want  (if  we 
give  it  enough  time  for  averaging),  the  equivalent  of  the  black  box.  In 
other  words,  we  have  made  an  indeterminate  white  box  that  might  have 
become  any  black  box.  By  proper  connection  to  the  black  box,  however, 
we  have  turned  it  into  the  operative  image  of  the  black  box. 

There  are  other  ways  of  achieving  equivalence  besides  orthogonal 
expansion.  We  can,  for  example,  use  nonorthogonal  or  biorthogonal  ex- 
pansions and  similar  tricks.  Gabor  and  other  people  have  studied  this 
sort  of  learning  of  one  apparatus  to  be  another  in  various  ways.  In  all  of 
them,  the  ideas  of  the  random  input,  of  linear  combination  (although  this 
is  not  necessarily  so  important  here),  of  multiplying  potentials,  and  of 
time  averages  are  all  involved.  Thus,  we  have  a  technique  with  many 
varieties . 


ANALOGY  TO  LIFE  PROCESSES 

What  we  have  done  is  very  interesting.  It  is  analogous  —  I  am  not 
saying  it  is  the  same  process  in  any  detail  —  to  what  happens  in  biology 
when  a  virus  or  a  gene,  a  molecule,  is  put  into  an  indeterminate  nutritive 
medium.  The  medium  might  have  nourished  other  viruses  or  genes  or 
antibodies  for  that  matter,  but  it  will  make  more  molecules  in  the  image 
of  the  ones  we  put  in.  This  process  has  a  logical  similarity  —  again,  I 
don't  say  a  similarity  in  detail  —  to  the  fundamental  biological  process 
of  life. 
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If  we  compare  the  results  I've  presented  with  what  happens  in  life,  we 
get  some  interesting  things.  The  present  idea  of  how  genes  and  viruses 
multiply  involves  a  double-spiral  theory  of  the  nucleic  acids  as  the 
model  of  multiplication.  Essentially  it  accounts  for  a  short-distance 
multiplication  and  gives  a  good  account  of  the  statics  of  multiplication. 
However,  it  needs  to  be  amplified  to  really  account  for  life  functions,  by 
dynamics.  What  we  have  here  may  be  the  basis  for  such  a  dynamic  theory. 

There  are  other  elements  that  are  also  of  very  great  interest.  This  is  a 
heredity  machine.  Heredity  is  the  basis  of  natural  selection  in  animals 
or  plants,  which  may  be  regarded  as  a  sort  of  racial  learning.  The  ex- 
istence of  heredity  is  necessary  for  the  variations  to  be  carried  on,  and 
it  is  necessary  for  racial  learning  processes  to  exist. 

Does  this  sort  of  thing  have  anything  to  do  with  individual  learning? 
It  has  a  great  deal  to  do  with  it. 


LEARNING  MACHINES 

To  illustrate  this  relation  to  individual  learning,  I  shall  discuss  briefly 
the  learning  machines  that  have  been  built  by  IBM  and  by  others,  specifi- 
cally the  machine  built  by  Samuel  to  play  checkers  [2]. 

The  crudest  sort  of  machine  for  playing  checkers  would,  at  every  stage 
of  the  game,  select  the  moves  that  are  allowable  and  would  weight  each 
move  in  terms  of  the  number  of  pieces  the  machine  would  have  at  the  end 
of  the  move,  the  number  of  pieces  the  opponent  would  have,  mobility, 
command,  and  so  on.  It  would  attempt  to  give  this  weighting  on  some 
basis  that  has  proved  humanly  reasonable  for  playing  checkers.  Such  a 
machine  will  play  checkers,  but  it  won't  play  very  good  checkers.  In 
particular,  once  one  gets  the  hang  of  it  as  an  opponent,  its  weaknesses 
will  always  be  weaknesses.  So,  wherever  it  is  weak,  sooner  or  later  one 
can  get  the  upper  hand  on  it. 

Such  a  machine  would  still  seem,  if  one  played,  say,  a  correspondence 
game  with  it,  to  have  a  game  personality,  but  a  very  rigid  one.  Let  us 
take  the  next  stage.  We  have  a  machine  that  plays  this  way,  but  after 
each  series  of  moves  it  goes  back  into  its  old  record  of  plays  and  op- 
ponent's plays  that  is  stored  in  its  memory.  It  does  not  ask  the  question: 
How  should  I  carry  out  this  policy?  Rather  it  asks:  What  weighting  of 
these  considerations  (pieces  on  one  side,  pieces  on  the  other,  etc.) 
would  have  corresponded  most  to  won  games  and  least  to  lost  games  if 
this  evaluation  had  been  made  on  the  games  played?  The  machine  then 
finds  a  new  evaluation.  It  continues  to  play  with  this  new  evaluation 
and,  from  time  to  time,  it  again  submits  its  evaluation  to  criticism  with 
respect  to  success  in  the  games  played. 
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In  this  way,  a  trick  that  worked  with  the  machine  at  the  beginning  will 
fail  to  work  because  the  machine  will  get  on  to  it.  It  will  learn  that  that 
particular  weighting  no  longer  corresponds  to  won  games  because  the 
opponent  can  counter  it.  IBM  has  built  such  a  machine  which,  with  rela- 
tively short  programming,  has  been  able  to  defeat  consistently  the  man 
who  programmed  it. 

RELATION  OF  LEARNING  MACHINE  AND  GENETIC  MACHINE 

The  checker-playing  machine  is  a  learning  machine  in  every  sense. 
These  machines,  then,  do  exist.  Have  they  any  wider  scope? 

The  learning  machine  and  the  genetic  machine  discussed  previously 
have  more  in  common  than  meets  the  eye.  In  the  genetic  machine,  the 
existing  machine  (the  black  box)  changes  the  machine  opposed  to  it  (the 
white  box)  so  that  the  white  box  will  be  like  itself.  In  the  checker-playing 
machine,  the  machine  is  the  white  box  and  the  human  opponent  the  black 
box.  The  objective  here  is  not  for  the  human  to  change  the  machine  to  be 
like  himself,  but  for  the  white  box  to  be  able  to  defeat  the  black  box. 
There  is  a  coupling  of  two  machines,  one  of  which  happens  to  be  the 
human  being,  which  changes  one  of  the  machines  to  accomplish  some 
purpose. 

This  fundamental  idea  of  the  modification  of  one  machine  by  another 
through  coupling  exists  in  both  learning  and  genetic  machines.  Note  that 
as  an  incidental  result,  the  white  box  will  learn  some  of  the  policies  of 
the  black  box  and  will,  to  a  limited  extent,  become  more  like  it. 

CHESS-PLAYING  MACHINES 

The  programming  of  chess-playing  machines  will  have  to  be  much 
finer  than  that  of  checker-playing  machines  in  order  to  consider  what  is 
advantageous  and  what  is  disadvantageous.  There  will  have  to  be  much 
more  consideration  of  the  differences  between  different  .stages  of  the 
game.  Special  policies  will  be  needed  for  beginning  and  ending  games. 
Some  of  my  chess-playing  friends  claim  that  the  next  10  to  25  years  will 
probably  see  chess-playing  machines  playing  master  chess,  which  they 
are  far  from  doing  now.  Today  they  can  play  a  rather  poor  amateur  chess 
at  best. 

These  machines  that  learn  to  oppose  a  player  provide,  in  my  opinion, 
a  much  more  realistic  theory  of  games  than  the  Von  Neumann  theory  of 
the  minimax  game.  The  Von  Neumann  theory,  in  its  standard  form,  does 
not  allow  us  to  make  use  of  the  sheer  stupidity  of  the  opponent.  That  is, 
we  have  to  assume  the  opponent  will  play  the  best  possible  game  against 
the  best  possible  game  we  can  play.  This  may  be  a  definitely  bad  policy 
for  handling  an  opponent  who  is  not  a  very  good  player. 
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TRANSLATING  MACHINES 

There  are  games  in  which  relations  other  than  winning  or  imitation 
enter.  Let  us  consider  machines  that  translate  one  language  into  another, 
and  try  to  program  such  a  machine  to  be  a  learning  machine.  One  of  the 
things  that  is  most  difficult  in  designing  a  translation  machine  is  to 
build  a  critique  into  it.  In  a  game-playing  machine,  the  critique  is  straight- 
forward. The  machine  is  good  if  it  wins.  It  is  bad  if  it  loses.  In  a  copying 
machine,  the  machine  is  good  if  it  is  like  its  model  in  behavior,  and  bad 
if  it  is  unlike  it.  What  can  be  said  about  a  translating  machine? 

The  main  difficulty  with  a  translating  machine  is  that  there  is  no 
abstract  criterion  of  a  good  translation.  We  can  do  two  things  about  this. 
One  is  to  go  very  elaborately  into  a  theory  of  language  and  try  to  build 
up  a  theory  of  a  good  translation.  This  is  a  long  way  around,  however, 
and  I  don't  believe  it  is  the  right  way.  The  other  possibility  is  to  make 
the  machine  learn. 

Let  us  take  a  translation  machine  and,  to  eliminate  some  of  the  prob- 
lems, let  it  be  a  double-translation  machine.  For  example,  it  may  trans- 
late from  English  into  Italian  and  from  Italian  back  into  English  by  inde- 
pendent operations.  The  criterion  of  combined  goodness  is:  When  the 
double  process  is  completed,  does  the  machine  provide  a  humanly  ac- 
ceptable equivalent  of  the  original?  This  could  also  be  done  in  one 
stage,  but  it  indicates  what  I  mean.  We  then  use  a  human  being,  who  has 
the  values  of  good  translation  and  bad  translation,  to  give  the  figure  of 
merit  to  our  translation  effort.  After  all,  these  translations  are  being 
done  for  human  purposes. 

To  teach  the  machine,  we  would  put  in  a  random  series  of  messages, 
just  as  we  put  in  a  random  series  of  inputs  to  the  black  box  and  just  as 
we  played  a  random  series  of  games.  These  are  exercises,  like  a  school- 
boy's. Then  the  teacher,  the  human  being,  marks  these  exercises  and  the 
machine  is  reprogrammed  according  to  what  would  have  been  a  good  set 
of  translation  rules  for  acceptance  by  a  human  being. 

We  have  an  inverse  probability  problem  here,  very  much  of  a  Bayes 
problem.  We  are  not  asking  what  is  the  distribution  of  outputs  when  we 
have  a  certain  input,  but  what  is  the  distribution  of  rules  that  would  best 
correspond  to  good  performance. 

Having  put  in  a  critique  of  performance,  having  put  in  something  that 
might  be  called  Pavlovian  pleasure  and  pain  that  makes  the  machine 
continue  when  its  performance  is  satisfactory  and  change  when  its  per- 
formance is  not,  we  get  a  learning.  This  learning  is  not  the  same  in  de- 
tail, but  certainly  the  same  in  principle,  as  the  learning  of  a  child  study- 
ing a  new  language  under  a  teacher  and  having  his  exercises  corrected. 
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In  such  a  way,  we  can  transmit  values  from  one  machine  to  another 
and,  in  particular,  from  a  human  machine  to  a  mechanical  one.  This  can 
also  be  done  indirectly.  A  well-trained  machine  can  become  a  teacher  to 
another  machine  or  to  a  person.  We  have  here,  therefore,  the  basis  for 
learning  machines.  In  this  series  of  learning  machines,  the  learning 
machines  for  games  and  the  learning  machines  for  copying  occupy  special 
places.  This  is  one  way  of  building  up  self-organizing  systems. 

SELF-ORGANIZATION  BY  OSCILLATION 

I  now  wish  to  discuss  another  form  of  self-organization  that  occurs, 
both  in  living  organisms  and  in  machines.  This  is  self-organization  by 
oscillation  or  pulling  together  of  frequencies.  As  an  example,  I  shall 
discuss  the  self-organization  of  electrical  power-generating  systems. 

Essentially,  an  ac  electrical  generating  system  consists  of  a  number 
of  alternators,  driven  by  prime  movers,  where  the  speed  of  the  prime 
mover  is  regulated  by  a  governor.  The  governor  keeps  the  prime  mover  to 
speed  and,  by  this  speed,  determines  the  frequency  of  the  output.  How- 
ever, nowadays,  generating  systems  are  much  more  complicated.  I  have 
a  number  of  similar  generators,  in  many  cases  a  large  number,  taking  the 
load  at  the  outside  (Figure  7).  We  also  have  apparatus  to  prevent  the 


Alternators 


Figure  7. 

system  from  going  haywire.  The  generators  are  not  tied  to  the  busbars 
until  they  are  at  nearly  the  same  frequency  with  one  another;  in  any  case, 
where  the  load  flow  begins  to  throw  one  generator  too  far  out,  it  is  thrown 
off  the  busbar.  These  operations  are,  incidentally,  highly  nonlinear. 

Even  when  they  are  on  the  same  busbar,  the  generators  are  no  longer 
independent.  If  one  picks  up  a  minute  increment  of  speed,  that  is,  just 
begins  to  go  fast,  this  type  of  connection  will,  itself,  give  a  motor  action 
on  the  generator  in  the  opposite  direction  to  its  motion.  We  increase  its 
load,  and  we  slow  it  down.  The  generators  going  too  slowly  will  have  a 
back  action  in  the  other  direction  and  will  be  speeded  up.  Thus,  there  is 
a  tendency  for  the  frequencies  of  the  coupled  generators  to  pull  one 
another  together,  the  fast  ones  being  slowed  down  and  the  slow  ones 
being  speeded  up.  This  combination  gives  us  an  effective  governor  for 
the  entire  system  and  generally  is  much  more  accurate  than  the  individual 
governors.  It  permits  a  highly  organized  precision  of  frequencies  that 
can  be  used  in  electric  clocks  and  similar  devices. 
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EXAMPLES  FROM  BIOLOGY 

The  same  thing  occurs  in  biology.  For  example,  some  of  the  fireflies 
of  New  Guinea  and  the  Orient  tend  to  flash  in  unison  in  the  trees,  so 
much  so  that  one  can  time  the  flashing.  Why?  The  firefly  not  only  emits 
light  periodically,  or  nearly  so,  but  it  sees  the  light  of  the  other  fireflies. 
This  light  tends  to  pull  the  fireflies  into  unison  with  the  flashes  that  are 
given  to  it.  In  such  a  system  we  have  a  pulling  together  of  frequencies 
that  will  tend  to  emphasize  greatly  certain  individual  frequencies. 

The  importance  of  this  self-organizing  principle  has  also  been  recog- 
nized in  the  organization  of  the  shape  of  snowflakes  [3].  Snowflakes,  of 
course,  are  crystalline  structures  with  periodic  lattices.  But  this  alone 
is  not  sufficient  to  explain  why  the  different  limbs  of  the  snowflake 
develop  in  exactly  the  same  way,  or  nearly  the  same  way,  in  the  same 
snowflake.  The  suggestion  is  made  that  elastic  resonance  exists  between 
the  different  limbs,  which  tends  to  make  the  smaller  limbs  assume  the 
same  frequency  as  the  others  by  growing  and  to  make  the  larger  limbs  as- 
sume the  same  frequency  by  more  evaporation.  This  resonance,  then, 
tends  to  pull  the  limbs  of  the  snowflake  together  in  much  the  same  way 
that  the  electrical  generating  system  pulls  alternators  together.  There  is 
considerable  evidence  that  this  is  the  case,  but  it  needs  to  be  gone  over 
in  a  much  more  precise  mathematical  way  than  has  been  done  thus  far. 

Human  alpha  rhythm  also  exhibits  this  phenomenon  of  pulling  together 
of  frequencies.  Typical  electroencephalogram  (EEG)  spectra  (Figure  8) 
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Figure  8. 


are  obtained  by  computing  the  autocorrelation  of  the  EEG  output  and 
taking  the  Fourier  transform  of  the  autocorrelation.  The  spectrum  has  a 
narrow  peak  very  sharply  differentiated  from  the  rest  of  the  frequencies 
present  with  a  more  or  less  characteristic  shape.  This  peak  is  stable  for 
hours  with  a  width  of  about  a  third  of  a  cycle.  This  alpha  rhythm,  which 
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is  not  markedly  present  in  all  people  but  which  is  very  frequent,  can  be 
driven  by  a  flicker  into  the  eye  at  about  its  own  frequency  (a  very  un- 
pleasant experience).  A  system  that  can  be  driven  by  a  frequency  near  to 
its  own  is  a  nonlinear  system.  However,  in  nonlinear  systems,  different 
frequencies  cannot  act  upon  one  another.  My  suggestion  is  that  the 
narrowness  of  this  peak  was  caused  by  pulling  together  of  frequencies. 
There  is  further  evidence  that  this  hypothesis  is  correct.  The  question 
is,  what  is  the  original  oscillation  of  the  EEG  like?  It  is  well  known 
that  if  we  put  a  sharp  shock  into  the  brain  at  some  point,  the  current 
returns  to  zero,  in  the  fashion  of  a  heavily  damped  oscillation  (Figure  9). 


Figure  9. 

This  is  called  the  after-discharge.  If  we  put  a  large  number  of  random 
impulses  with  Poisson  distribution  into  the  brain,  we  can  make  a  har- 
monic analysis  of  the  response  by  autocorrelation  methods,  just  as  we 
did  for  the  brain  waves.  Barlow  of  Massachusetts  General  Hospital  has 
done  precisely  this  [4].  He  has  studied  different  individuals  and  found 
both  what  seems  to  be  the  central  frequency  of  the  alpha  rhythm  and  this 
randomly  evinced  after-discharge.  Although  this  frequency  is  not  the 
same  for  all  people,  he  has  found  a  very  high  positive  correlation  be- 
tween the  position  of  the  center  frequency  for  the  after-discharge  and 
for  the  alpha  rhythm.  However,  for  the  same  person,  the  after-discharge 
frequency  spectrum  is  much  broader  than  the  alpha-rhythm  spectrum 
(Figure  10).  This  is  definite  evidence  that  there  is  pulling  together  of 
frequencies  in  the  self-organization  of  the  alpha  rhythm. 


Thus,  as  we  have  seen,  there  are  suggestions  that  not  only  in  crystals 
but  in  living  organisms,  the  symmetry,  the  preservation  of  form,  the 
organizer  may  be  found  by  pulling  together  of  oscillations  in  frequency. 
It  is  a  most  interesting  and  powerful  concept. 
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APPENDIX  I:  SOCIAL  IMPLICATIONS  OF  LEARNING  MACHINES1 

There  are  some  very  important  things  to  be  said  about  the  social 
consequences  of  learning  machines.  A  learning  machine  is  not  completely 
programmed  when  it  is  built;  much  of  its  programming  comes  later,  from 
its  experience.  This  means  that  learning  machines  are  going  to  be  very 
unpredictable  tools  because  of  the  very  property  they  are  used  for  -  the 
ability  to  do  more  than  has  been  explicitly  put  into  them  at  the  start. 

If  we  use  them  to  make  decisions,  the  good  to  the  user  of  their  de- 
cisions will  depend  on  the  experience  to  which  the  machine  has  been 
subjected  after  it  was  built.  That  this  is  not  trivial  is  obvious  from  the 
game  machines.  If  a  checker-playing  machine  can  defeat  the  man  who 
programmed  it  after  10  or  20  hours  of  programming,  it  is  quite  clear  that 
there  is  something  in  the  programming  that  was  not  explicitly  put  in  by 
the  programmer  and  that  he  does  not  even  know  explicitly.  This  means 
that  the  policy  of  the  machine  is  not  completely  predictable,  unless  one 
actually  does  the  work  of  the  learning  machine  and  learns  for  himself 
how  the  machine  is  likely  to  be  affected  by  its  experience. 

This  will  result  in  tremendous  risks  in  the  future.  Because  the  learning 
machine  is  not  completely  predictable,  it  is  quite  possible  that  it  will 
develop  policies  that  have  not  been  thought  of  before,  resulting  in  con- 
sequences that  have  not  been  considered  before  the  machine  was  used. 
There  is  no  reason  to  believe  that  the  new  values  the  machine  develops 
will  be  those  we  want:  the  machine  to  have. 

This  new  development  of  automation  is  one  that  needs  watching  all  the 
time.  Some  of  my  colleagues  say:  "But,  there  is  no  danger  in  it.  We  can 
put  in  fail-safe  devices.,,  That's  a  brilliant  idea.  Of  course  we'll  have 
to  put  in  fail-safe  devices  to  prevent  the  machine  from  doing  things  that 
are  risky.  But  if  we  use  the  machines  to  program  a  policy  —  very  often  a 
policy  to  be  used  very  immediately  to  decide  what  to  do,  such  as  an 
atomic  alarm  -  by  the  time  we  discover  that  the  decision  has  dangerous 
consequences,  it  may  be  too  late  to  make  it  fail-safe.  In  other  words,  the 
consequences  will  already  be  there  before  we  see  that  they  are  there. 
This  is  a  very  definite  possibility,  as  those  who  have  actually  worked 
with  these  devices  are  quite  aware. 

The  use  of  this  learning,  or  second-order  type  of  automatization,  where 
the  program  is  a  developing  one  and  only  general  policy  is  given  the 
machine,  is  likely  to  be  a  double-edged  sword.  We  are  likely  to  have 
before  us  the  problem  of  Goethe's  sorcerer's  apprentice.  Left  alone  by 
the  magician  and  being  decently  lazy,  the  apprentice  used  the  magician's 


^Ed.  note:  This  Appendix  is  based  on  remarks  made  by  Professor  Wiener  at 
the  symposium  dinner. 
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incantation  to  get  the  broom  to  fetch  the  water  he  was  supposed  to  fetch. 
Unfortunately,  he  hadn't  learned  the  words  to  stop  the  broom,  and  the 
machine  went  on  fetching  water  until  it  was  threatening  to  drown  him. 
Then  the  apprentice  had  the  brilliant  idea  of  the  fail-safe  device  —  he 
broke  the  broom  over  his  knee.  The  result,  however,  was  that  each  half 
of  the  broom  went  on  carrying  water.  It  had  been  programmed  for  that  - 
the  apprentice  didn't  realize  this.  Now  the  water  was  really  threatening 
to  drown  him,  when  the  magician  returned  and  stopped  the  whole  thing. 

The  point  I  am  making  is  that  in  the  social  use  of  these  new  learning 
devices  we  are  in  a  very  dangerous  situation.  Such  fairy  tales  as  the 
sorcerer's  apprentice  or  the  monkey's  paw  represent  actual  dangers.  We 
have  with  these  machines  a  very  dangerous  social  problem,  akin  to  the 
problem  of  slavery.  I  refer  here  not  to  slavery  as  being  cruel,  but  as  being 
inconsistent.  We  want  two  things  of  our  machines,  as  we  did  of  our 
slaves:  subservience  and  intelligence.  There  is  a  sort  of  dualism,  not 
unlike  the  quantum  dualism,  between  the  two;  the  more  we  get  of  one, 
the  less  we  get  of  the  other.  As  a  result,  we  will  have  to  give  up  one  or 
the  other  to  some  extent  —  we  will  have  to  compromise. 

That  this  danger  exists  is  realized  by  the  people  who  actually  work  on 
learning  machines.  But  it  is  a  danger  that  is  both  masked  and  intensified 
by  the  emotional  preferences  among  engineers  for  gadgets  over  human 
beings.  This  feeling  that  gadgets,  because  they  are  not  human,  are  more 
controllable  is  exceptionally  dangerous  at  the  present  time. 


APPENDIX  II:  FURTHER  REMARKS  ON  MULTIPLICATION  AS  AN 
INFORMATION  PROCESS2 

In  discussing  multiplication,  I  believe  the  whole  point  is  to  maintain 
identity  of  the  informational,  not  the  physical,  parts.  If  it  were  physical 
parts,  then  the  moment  a  gene  splits  in  two,  we  would  have  different 
physical  parts. 

If  we  start  with  a  given  machine,  make  a  functional  reproduction  of  it, 
and  reproduce  it  again  with  another  functional  reproduction,  the  second 
machine  may  be  quite  different  in  structure  from  the  first.  However,  the 
third  machine  may  be  the  same  as  the  first.  In  other  words,  in  two  stages, 
functional  reproduction  will  give  us  pictorial  reproduction.  However,  if 
we  consider  a  long  sequence,  we  lose  the  first  stage  —  there  is  no  first 
in  it. 


^Ed.  Note:  This  Appendix  is  based  on  comments  made  by  Professor  Wiener  in 
response  to  questions  following  his  prepared  talk. 
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Can  the  sort  of  theory  of  copying  I  have  presented  allow  for  variation? 
I  think  it  can.  The  imitation  that  I  get  depends  on  the  number  of  terms  I 
take,  which  can  be  large  or  small,  and  on  the  time  it  takes  for  averaging, 
which  determines  the  precision  of  the  coefficients.  The  variation  would 
be  cumulative,  not  in  the  sense  of  being  in  one  direction  but  in  a 
drunkard' s-walk  sort  of  way.  After  all,  when  we  imitate  an  imitation  of  an 
imitation  of  an  imitation,  the  result  may  be  quite  different  from  what  we 
anticipated. 

I  wish  to  make  it  clear  that  the  techniques  I  have  described  have  only 
a  qualitative  relevance  to  the  actual  process  of  genetics.  There  are  many 
other  processes  that  can  do  this  sort  of  thing;  I  have  only  given  one 
process  that  we  may  carry  through  fairly  thoroughly.  Although  I  do  not 
want  it  to  be  taken  seriously  as  a  picture  of  how  inheritance  actually 
takes  place,  I  do  want  it  to  be  taken  seriously  that  there  is  no  intrinsic 
reason  why  machines  can't  have  a  heredity. 

I  would  like  to  point  out,  also,  that  I  have  deliberately  discussed  only 
the  one-input,  one-output  system  in  order  to  get  a  clean-cut  statement  of 
how  to  design  and  limit  my  problem.  However,  there  is  no  reason,  in 
principle,  why  the  results  are  not  applicable  to  the  many-input,  many- 
output  cases  and,  with  the  proper  sense  organs,  to  inputs  and  outputs  of 
a  different  nature,  e.g.,  light. 

The  theory  as  it  stands  assumes  reasonably  constant  conditions.  If  the 
parameters  of  the  black  box  were  changing  with  time,  we  would  not  have 
a  well-defined  nonlinear  transducer.  Changing  machines  are  very  tricky. 
In  Alice  in  Wonderland,  the  classical  source  to  which  all  mathematicians 
refer,  the  game  theory  of  the  croquet  game  is  very  much  disturbed  by  the 
flamingos  (who  are  the  mallets)  looking  Alice  in  the  face,  the  ball  hedge- 
hogs rolling  away,  and  the  rules  of  the  game  being  constantly  changed. 
This  applies  in  any  theory  and,  certainly,  in  a  theory  of  how  the  game, 
we'll  say,  could  multiply.  If  we  allow  random  changes  with  time,  we  must 
make  the  game  a  multiple-input  game  and  allow  these  changes  to  be  part 
of  the  input. 

Finally,  I  wish  to  point  out  that  my  results  have  been  presented 
from  communications,  not  energy,  considerations.  The  apparatus  I  have 
considered  is  not  conservative  in  energy.  For  example,  it  employs 
cathode  followers  for  amplification  from  the  outside  so  that  energy  con- 
siderations are  not  appropriate.  In  the  multiplication  of  animals,  energy 
considerations  are  also  not  appropriate.  The  animal  is  not  a  closed 
system.  It  is  eating  and  feeding,  the  genes  are  being  nourished  by  the 
nutrient  solution,  etc.  Therefore,  although  undoubtedly  there  are  ques- 
tions that  can  be  raised  by  energy  considerations,  they  will  have  to  be 
handled  most  carefully. 
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APPENDIX  III:  ON  THE  FUNCTION  OF  SCIENCE  IN  SOCIETY3 

During  the  summer  of  1960  I  spent  a  month  in  Russia  as  part  of  an 
eight-month  trip  abroad.  In  Russia  I  attended  the  Congress  on  Control 
and  Automatization,  a  very  successful  international  conference.  While  I 
was  there,  Russia's  chief  philosophical  journal,  Voprosii  Filosofii, 
asked  me  for  an  article,  which  I  wrote  and  which  they  accepted.  The 
purpose  of  this  article  was  to  express  my  sincere  opinion  about  certain 
difficulties  which  are  to  be  found  in  the  development  of  science  in 
Russia,  but  which  are  by  no  means  restricted  to  that  country. 

I  pointed  out  that  I  consider  the  function  of  science  in  society  to  be 
homeostatic.  Homeostasis  is  a  technical  term  of  the  physiologist  to 
describe  the  necessary  tendency  of  living  organisms  to  maintain  a  stable 
environment.  For  example,  to  render  continued  life  even  possible,  the 
limited  range  of  body  temperatures  at  which  average  human  beings  can 
exist  is  from  about  70  deg  to  about  105  deg.  Similarly,  our  pulse  is  regu- 
lated within  a  rather  narrow  range  of  frequencies.  From  a  great  many 
points  of  view,  we  can  also  describe  the  function  of  the  nervous  system 
to  be  homeostatic  —  that  is,  we  can  exist  in  a  large  number  of  environ- 
ments because  we  can  adjust  to  them. 

In  our  sense  organs  there  are  remarkable  homeostatic  mechanisms. 
For  example,  we  can  see  not  only  in  the  brightest  sunlight,  but  also,  to 
some  extent,  on  a  moonless  night  with  only  the  light  of  the  stars;  this 
represents  a  range  of  intensities  of  many  millions.  Here  we  have  homeo- 
static mechanism  piled  upon  homeostatic  mechanism.  We  think  of  the  eye 
primarily  in  terms  of  the  contraction  of  the  pupil,  the  pupil  being  closed 
in  bright  light  and  open  in  dim  light,  but  there  also  is  a  difference  in 
the  chemical  constituents  of  the  eye.  Certain  parts  of  this  organ  hardly 
work  at  all  in  bright  light  and  others  do  not  work  in  lower  intensities. 
Much  beyond  that,  we  have  a  regulation  in  the  nervous  system  of  the 
degree  of  amplification  that  a  stimulus  of  the  retina  can  cause  in  the 
nervous  system.  The  nervous  system  involves  not  one  but  many  ampli- 
fiers, so  that  there  is  a  wide  range  between  the  intensity  of  the  finally 
received  impulse  in  comparison  with  the  impulse  sent. 

The  body  has  an  enormous  number  of  protective  mechanisms.  However, 
even  apart  from  the  internal  homeostasis  of  the  body  and  the  special 
homeostasis  of  the  nervous  system,  the  very  fact  that  sense  perception 
itself  is  homeostatic  allows  us  to  experience  the  environment,  to  com- 
pensate for  the  environment,  and  to  live  in  very  different  environments. 
Some  of  the  body's  homeostasis  is  automatic,  such  as  when  an  animal 


^Ed.   note:   This   Appendix  contains  remarks  made  by  Professor  Wiener  to  a 
group  of  Purdue  undergraduates  during  the  symposium. 
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that  is  cold  or  in  fear  ruffles  up  its  fur  to  get  better  protection  against 
the  greater  loss  of  heat  resulting  from  increased  metabolism.  But  the 
mere  perception  of  the  world  so  that  we  can  react  to  it  is  homeostatic. 

My  statement  to  the  Russians  was  that  the  function  of  science  in 
general  to  a  community  is  very  much  what  the  function  of  the  nervous 
system  is  to  man.  The  nervous  system  enables  man  to  have  sufficient 
knowledge  of  the  world  outside  himself  to  be  able  to  react  intelligently 
and  effectively  to  very  different  social  and  natural  environments.  Be- 
cause of  our  reactions  we  can  live  at  the  poles  or  in  the  tropics  and  can 
dress  ourselves  for  different  climates;  we  can  live  in  the  cities  as  well 
as  in  the  country,  the  city  being  the  most  biologically  abnormal  environ- 
ment. The  whole  regulation  of  intelligent  conduct  socially  and  politically 
is  very  largely  dependent  on  our  knowledge  of  the  environment,  in  other 
words,  on  scientific  knowledge. 

Scientific  knowledge  allows  us  to  react  effectively  to  a  wide  range  of 
situations  —  in  fact,  a  much  wider  range  than  we  are  actually  going  to 
face.  I  think,  however,  that  we  are  counting  too  confidently  on  having 
science  get  us  through  not  only  life's  normal  disturbances  and  changes 
of  environment  but  also  the  tremendous  changes  of  environment  that  we 
ourselves  are  producing.  Thus,  we  are  depending  on  the  homeostatic 
mechanism.  This  mechanism,  I  think,  is  likely  to  be  overstrained  and 
may  break  down  completely.  However,  until  it  does,  I  shall  speak  of  it  as 
I  would  of  the  homeostatic  mechanism  in  an  individual.  The  fact  that  an 
individual  finally  dies  does  not  prevent  the  maintenance  of  his  environ- 
ment while  he  is  living  being  important,  just  as  the  fact  that  society 
might  very  well  die  does  not  prevent  the  discussion  of  its  maintenance 
of  equilibrium  from  being  of  interest. 

If  we  limit  our  homeostasis  to  the  environment,  or  to  limit  situations 
that  we  know  will  take  place,  or  to  those  situations  predicted  by  a 
limited  social  theory  of  which  we  are  not  one  hundred  per  cent  sure,  then 
we  are  reducing  our  chances  of  reacting  correctly  to  the  most  dangerous 
situations  that  we  may  have  to  face.  A  true  homeostatic  mechanism  is  a 
purpose-seeking  mechanism  whose  actions  can  be  determined  safely  only 
as  the  mechanism  itself  develops.  I  pointed  out  to  the  Russians,  because 
I  thought  it  relevant  to  their  situation  and  not  irrelevant  to  ours,  the 
grave  dangers  that  can  result  from  taking  an  efficient  homeostatic  system 
and  deciding  too  narrowly  in  advance  what  situations  can  occur  and  what 
the  proper  adjustments  are  to  these  situations;  in  other  words,  to  deter- 
mine a  homeostasis  for  a  too  narrowly  defined  purpose  is  dangerous.  I  did 
not  say  anything  specific  about  Russia,  but  I  think  the  point  I  made  will 
be  understood  by  the  readers  of  the  article.  This  danger  of  having  too 
narrow  a  limitation  of  purpose  is  one  that  we  too  may  face.  If  we  are  to 
determine  that  anything  we  do  in  the  way  of  policy  must  fit  a  rigidly 
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defined  free  enterprise  system,  we  are  likely  to  run  into  situations  where 
this  policy  simply  does  not  fit  and  where  we  are  in  a  very  tight  hole.  In 
many  ways,  the  population  problem,  the  transportation  problem,  or  the 
proper  use  of  automatic  machines  are  situations  where  too  rigid  a  view  of 
what  is  permissible  may  lead  to  disaster. 

I  am  quite  certain  that  this  has  happened  in  Russia  within  their  own 
interpretation  of  Marxism.  Both  the  rigidly  defined  free  enterprise  system 
and  Marxism  are  not  new.  In  both  cases,  countries  are  trying  to  adapt 
their  whole  life  in  a  modern  world  with  very  different  problems,  to  ideas 
formulated  in  the  1840' s  and  1850' s.  These  ideas  were  formulated  at  the 
time  of  the  first  industrial  revolution,  a  revolution  of  power.  Today  we 
face  a  very  different  situation,  a  second  industrial  revolution,  a  revolution 
of  control.  In  other  words,  these  ideas  are  not  so  much  wrong  as  they  are 
irrelevant  to  the  problems  of  the  day. 

What  I  have  said  has  a  definite  relevance  to  the  training  of  scientists. 
We  want  people  who  will  be  able  to  face  yet  unknown  situations,  by  as 
yet  unknown  combinations  of  ideas  from  different  fields  of  work.  For 
this,  a  broad  basic  training  is  necessary.  So,  too,  are  crossing  the 
boundaries  of  scientific  specialization,  interdisciplinary  thinking,  and  a 
willingness  to  take  all  that  one  has  acquired  as  part  of  one's  available 
assets.  On  this  side  of  the  Iron  Curtain  a  present  difficulty  is  the  ten- 
dency to  parcel  out  science  in  small  pieces,  to  divide  scientists  into  two 
classes.  The  independent  people  are  being  given  the  information,  but  are 
mostly  doing  administration;  thus,  they  are  not  thinking  about  the  prob- 
lems. On  the  other  hand,  the  "stooges"  are  given  a  particular  piece  of  a 
piece  of  a  problem  to  do.  They  have  neither  the  training  required  nor 
access  to  information  about  related  pieces  of  the  problem  and  therefore 
cannot  do  efficient  work. 

I  believe  that  it  is  extremely  important  to  have  a  broad  basis  in  very 
different  sciences  for  one's  intellectual  work  so  that  one  can  follow  the 
problem  wherever  it  leads,  even  though  it  crosses  boundaries.  There 
should  not  be  a  customhouse  between  one  science  and  another  where  one 
must  pay  duty  when  going  from,  say,  physics  to  chemistry,  chemistry  to 
biology,  or  mathematics  to  realism.  We  must  keep  for  ourselves  and  for 
our  students  an  attitude  that  a  wide  interest  in  intellectual  matters  is 
desirable  and  necessary.  If  a  problem  leads  us  into  a  new  field  in  which 
we  have  no  knowledge,  we  should  acquire  such  knowledge.  It  is  no  ex- 
cuse, when  working  on  a  problem,  to  say  "but  that's  not  my  field."  At 
some  stage  or  other  one  must  decide  to  learn  what  is  needed  about  the 
field;  those  who  do  not  are  "stooges"  who  are  not  serving  their  social 
function  for  science. 

In  this  sort  of  teaching  there  is  a  very  important  stage,  which  I  shall 
call  the  apprentice  stage.  In  taking  young  men  into  scientific  work,  it  is 
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important  not  only  to  give  them  so-and-so's  theorem  and  similar  memory 
work,  but  also  to  treat  them  from  the  beginning  as  junior  colleagues,  to 
find  out  ideas  with  them,  and  to  display  to  them  not  only  the  results  of 
the  work  but  how  it  feels  to  do  it.  They  should  be  taken  to  places  where 
one  gets  stuck,  become  accustomed  to  pulling  themselves  out  of  tight 
situations,  and  encouraged  in  their  general  interest  in  other  fields.  It  is 
a  most  responsible  job,  one  to  which  few  of  our  teachers  and  advanced 
students  are  devoted. 

When  I  train  a  young  scientist,  I  want  to  have  a  man  who  will  be  able 
to  help  me  in  the  future,  a  bright  junior  or  senior.  Therefore,  I  give  him  a 
part  of  the  program  I  am  doing;  it  may  be  a  small,  menial  part,  such  as  a 
computational  job.  I  have  been  doing  this  with  quite  a  few  young  men.  I 
set  up  the  problem  and  have  him  compute  it*  When  he  is  finished,  he 
tells  me  the  difficulties  he  has  run  into.  Thus,  I  have  not  really  given 
him  the  job  as  a  menial  job;  I  have  helped  him  understand  it  and  have 
explained  not  just  the  small  portion  with  which  he  is  concerned,  but 
what  I  am  doing  overall.  I  listen  to  his  suggestions  and  accept  him  as  a 
colleague  as  much  as  possible. 

I  consider  all  these  things  to  be  important,  and  pointed  them  out  in 
the  article.  I  did  not  cross  my  t's  and  dot  my  i's.  I  wanted  to  make  my 
Russian  readers  think  and  see  the  relevance  to  their  own  situation,  and 
I  am  sure  that  many  of  them  will. 
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RICHARD    BELLMAN 


The  Basic  Concepts  of 
Dynamic  Programming 


The  concept  of  transformation,  embodying  as  it  does  the  idea  of 
changes  in  time  and  space,  is  a  basic  one  in  many  fields  of  study.  It  is 
particularly  penetrating  and  illuminating  in  the  mathematical  domain.  Add 
to  this  classical  theme  the  modern  notion  of  *  'decision"  and  we  have  the 
basic  ingredients  of  a  new  field  of  mathematics,  the  theory  of  dynamic 
programming. 

There  is  much  to  challenge  the  mathematician,  with  many  problems  for 
all  tastes  ranging  from  the  most  applied  to  the  most  abstract  and  back 
again.  For  those  who  enjoy  various  degrees  of  contact  with  the  outside 
scientific  world  there  is  the  continuing  effort  involved  in  formulating  the 
basic  questions  of  science  into  the  language  of  mathematics.  Some  of  the 
questions  that  arise  in  fields  such  as  feedback  control  or  trajectory 
analysis,  chemical  process  control  or  neurophysiology,  scheduling  or 
inventory  theory,  quantum  mechanics  or  medical  diagnosis,  can  be  posed 
naturally  in  analytic  terms;  some  achieve  it  with  increasing  quantization 
of  the  field;  and  some  have  mathematical  formulation  thrust  rudely  upon 
them. 

Perhaps  one  of  the  most  beautiful  aspects  of  mathematical  analysis  is 
its  chameleon  quality.  It  is  not  necessary  for  a  physical  process  to  be  a 
multistage  decision  process  in  its  native  form,  for  it  to  be  treated  by  the 
techniques  of  dynamic  programming.  It  is  sufficient  for  it  to  be  posed  in 
equivalent  terms  to  permit  the  application  of  these  techniques. 

The  mathematician  is  repaid  for  his  bold  venture  into  the  outside  world 
by  the  discovery  of  new  types  of  functional  equations  which  please  his 
fancy  and  test  his  skill.  In  some  ways,  the  solutions  of  these  equations 
are  more  intricate  in  nature  than  those  of  the  classical  equations  of 
mathematical  physics.  Yet,  in  other  ways,  because  of  a  duality  natural  to 
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decision  processes,  the  behavior  of  the  solution  is  more  readily  dis- 
covered and  comprehended.  Once  uncovered,  this  duality  can  be  artifi- 
cially introduced  into  classical  processes  far  removed  from  decision 
making. 

Particularly  in  connection  with  the  numerical  solution  of  these  equa- 
tions do  we  face  some  formidable  barriers.  To  render  some  of  the  prob- 
lems amenable  to  computational  solution  a  la  digital  computer,  we  must 
exercise  ingenuity  and,  in  addition,  call  upon  all  of  the  resources  of 
classical  analysis. 

It  is  absolutely  essential  to  acknowledge  the  fact  that  all  three  as- 
pects of  mathematical  research  — conceptual,  analytic,  and  computational 
—  are  intimately  intertwined.  The  twin  goals  of  structural  analysis  and 
numerical  solution  must  affect  the  original  formulation.  From  the  very 
beginning,  we  observe  that  there  are  many  questions  that  can  be  asked 
concerning  a  particular  physical  process.  Only  if  the  appropriate  ques- 
tions are  asked  will  we  be  able  to  obtain  appropriate  answers.  See  Ref. 
[1]  for  a  further  discussion  of  this  point. 

In  the  pages  that  follow  we  wish  to  present  some  of  the  basic  concepts 
of  the  theory  of  dynamic  programming  without  involving  ourselves  in 
analytic  details.  For  these,  and  applications  and  examples,  the  reader 
may  consult  the  books;  Refs.  [2A].  Briefly,  we  wish  to  indicate  how 
some  of  these  ideas  may  be  applied  to  classical  analysis  (quasi-lineari- 
zation),  and  to  mathematical  physics  (invariant  imbedding). 


MULTISTAGE  DECISION  PROCESSES 

A  multistage  decision  process  may  be  described  simply  in  the  following 
abstract  form.  We  begin  with  a  set  of  quantities  p,  considered  to  be  the 
elements  of  a  set  of  space  S,  and  a  set  of  transformations  T  (p,  q),  de- 
fined f or  p  €  S  and  for  q  e  R,  another  space,  possessing  the  property  that 

Pi  =  T  (p,  q)  e  S  if  p  e  S   and   q  e  R  (1) 

We  shall  call  p  the  state  variable,  or  vector,  and  q  the  decision  variable 
or  policy  vector. 

In  many  cases,  p  will  be  an  N-dimensional  vector  of  conventional 
form,  but  in  many  significant  situations  it  will  be  infinite-dimensional  or 
possess  components  that  are  not  merely  complex  numbers. 

If  a  succession  of  q's  is  chosen,  qlf  q2, ...,  qN,  a  succession  of 
states  Pi,  p2, . . . ,  pn  will  be  generated,  starting  with  the  initial  state  p, 

p,  Pl  =  Tip,  qj,  p2  =  T(Pl,  q2), . .  .,pN  =  T{pN_v  qN)  (2) 
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We  have  so  far  introduced  state  and  policy  vectors,  and  a  set  of  trans- 
formations. Let  us  now  introduce  the  final  member  of  the  triumvirate  that 
makes  up  a  multistage  decision  process,  a  criterion  or  return  function. 
We  hypothesize  the  existence  of  a  scalar  function, 

R=R(p,  Pv  p2,...,pN;  qv  q2,...,qN)  (3) 

which  evaluates  the  succession  of  states  and  decisions,  a  utility  function. 

Finally,  we  impose  the  condition  that  the  q.  are  to  be  chosen  to  maxi- 
mize the  value  of  this  criterion  function. 

Although  it  is  worth  while  initially  to  formulate  the  process  in  this 
generality,  in  order  to  obtain  constructive  results  it  is  necessary  to 
endow  the  criterion  function  with  certain  special  properties.  Fortunately, 
it  is  not  necessary  to  improvise;  the  many  examples  of  these  processes 
in  the  physical  world  show  us  immediately  what  some  of  the  important 
structural  properties  of  R  must  be. 

As  written  in  Equation  (3),  past,  present,  and  future  are  inseparably 
entwined.  Let  us  now  take  advantage  of  the  fact  that  in  the  physical 
world  there  are  significant  differences  between  the  past  and  the  future, 
with  the  present  acting  as  the  transient  line  of  demarcation.  Let  us  then 
begin  with  a  criterion  function  of  the  form, 

R  =g(p,  Qj)  +  g(pv  q2)  +  •••  +  g(pN.v  QN)  +h(pN)  (4) 

This  assumption  of  additivity  of  utilities  from  particular  stages  of  the 
process,  while  apparently  quite  restrictive,  is  actually  broad  enough  to 
permit  us  to  handle  many  significant  classes  of  processes,  including 
such  fields  as  the  calculus  of  variations  with  its  multitudinous  incarna- 
tions, and  the  general  theory  of  feedback  control;  Refs  [2-4].  A  dis- 
cussion of  numerous  activities  in  the  economic,  industrial,  engineering, 
and  military  spheres  will  be  found  in  these  references,  together  with 
analytic  and  computational  analyses. 


POLICIES  AND  OPTIMAL  POLICIES 

Let  a  choice  of  the  q's,  [q  q2f*.<,qNl,  be  called  a  policy  and  a 
choice  which  maximizes  R  an  optimal  policy.  To  determine  optimal  poli- 
cies in  an  effective  manner,  we  use  the  following,  intuitive  principle: 

Principle  of  Optimality.  An  optimal  policy  has  the  property  that  what- 
ever the  initial  state  and  initial  decisions  are,  the  remaining  decisions 
must  constitute  an  optimal  policy  with  regard  to  the  state  resulting  from 
the  first  decision. 
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The  functional  equations  of  dynamic  programming  are  derived  by  a 
uniform  and  systematic  application  of  the  foregoing  principle.  Frequently, 
a  certain  amount  of  analytic  manipulation  is  required  to  prepare  a  process 
for  an  efficient  application,  and  occasionally  a  bit  of  legerdemain  is 
indulged  in  to  determine  the  appropriate  states,  policies,  and  transforma- 
tions. A  philosopher  once  said  that  mathematics  is  the  science  of  tau- 
tology, but  this  is  most  certainly  an  a  posteriori,  not  an  a  priori,  judgment. 


A  BASIC  FUNCTIONAL  EQUATION 

To  derive  an  equation  that  will  simultaneously  permit  the  evaluation  of 
the  maximum  value  of  R  and  the  determination  of  the  optimal  policy,  we 
begin  with  the  surface  observation  that  this  maximum  value  depends  upon 
p,  the  initial  state,  and  N,  the  number  of  stages  in  the  process.  We 
therefore  write 

max  R  =  fN(p)  (5) 

an  explicit  indication  of  the  functional  dependence. 

In  so  doing  we  are  emphasizing  the  fact  that  p  and  N  are  not  to  be 
regarded  as  fixed  constants,  but  rather  as  variable  parameters.  The 
original  process  with  specified  values  of  p  and  N  has  thus  been  imbedded 
within  a  family  of  similar  processes. 

In  so  doing  we  are  employing  a  powerful  intellectual  principle,  the 
principle  of  comparison.  In  such  studies  as  comparative  anatomy,  com- 
parative religion,  and  so  on,  we  see  important  examples  of  this.  Rather 
than  absolutes,  we  have  relatives.  An  individual  situation  or  phenomenon 
may  often  be  best  understood  by  being  placed  in  perspective  with  com- 
parable situations  or  phenomena.  We  can  understand  an  organism  best  by 
knowledge  of  its  evolution  from  simpler  organisms. 

Returning  from  eternal  verities  to  the  function  fNip),  we  observe  that 
f^p)  is  readily  obtained: 

ftip)   =   max[g(p,  Q ,)   +   hip,)]  (6) 

where  Pj    =  Tip,  qj. 

If  we  can  then  somehow  reduce  the  N-stage  process  to  an  (N  -  l)-stage 
process,  we  have  an  inductive  solution  to  our  problem.  To  perform  the 
reduction  from  N  to  N  -  1,  observe  what  happens  when  an  initial  choice 
of  qx  is  made.  The  new  state  of  the  system  is  p1  =  Tip,  qj,  and  we  are 
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left  with  a  multistage  decision  process  in  which  the  aim  is  to  maximize 
the  quantity, 

g(plf  Q2)  +  g(p2,  q3)+-"+  g(pN.y  qN)  +h(pN)  (7) 

Hence,  as  a  result  of  an  initial  choice  of  q  ,  we  see  via  the  principle  of 
optimality  that  the  return  from  the  last  N  - 1  stages  is  given  by 
f/v-i^Pi^  =  fpf-i^^P*  ^i^-  The  total  return,  therefore,  from  the  N-stage 
process  as  a  result  of  an  initial  choice  of  q    will  be 

gfaqj   +  fNjT(p,qi))  (8) 

Since  our  aim  is  to  maximize  the  total  return  from  the  N-stage  process, 
qx  must  be  chosen  to  maximize  this  expression.  Hence  we  obtain  the 
functional  equation 

fN(p)  =  max  [g(p,  qt)  +  f^Tlp,  qj)]  (9) 

qeS 

On  the  basis  of  this  equation  we  can  analyze  the  analytic  structures  of 
the  maximum  return  and  optimal  policy  and  compute  numerical  solutions. 


SINGLE  STAGE   VERSUS  MULTISTAGE  ASPECTS 

The  method  described  above  allows  us  to  transform  the  original 
N-dimensional  maximization  problem  involving  a  choice  of  [qv  q2,...,  qN] 
into  a  succession  of  one-dimensional  maximization  problems  involving 
choices  of  qv  then  q2,  and  so  on.  There  are  advantages  and  disadvan- 
tages to  each  formulation,  as  is  to  be  expected.  There  are  no  universal 
solvents  in  chemistry  or  mathematics.  In  studying  a  particular  analytic  or 
physical  process,  both  approaches  should  be  held  in  readiness  and  often 
applied  in  conjunction.  Nothing  is  so  dangerous  to  scientific  progress  as 
a  dogmatism  which  banishes  all  but  one  approach. 

Far  more  important  in  many  ways  than  analytic  or  computational  ad- 
vantages of  the  new  approach  is  the  fundamental  difference  in  concept. 
The  emphasis  now  is  upon  the  policy,  the  choice  of  q  not  as  a  vector 
constant,  but  as  a  vector  function,  q  =  q{p,  N).  The  addition  of  the 
policy  function,  determined  as  the  maximizing  value  in  Equation  (9),  adds 
analytic  maneuverability  in  approximating  to  the  solutions  of  Equation 
(9).  It  introduces  a  new  analytic  technique,  "approximation  in  policy 
space,* '  and  leads,  as  we  shall  discuss  briefly  below,  to  the  technique 
of  quasi-linearization.  As  we  shall  see,  it  supplies  an  important  duality 
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to  the  calculus  of  variations  and  it  turns  out  to  be  a  key  to  the  modern 
theory  of  feedback  control,  allowing  us  to  formulate  the  more  complex 
stochastic  and  adaptive  control  processes  in  a  unified  fashion  [3], 

One  further  aspect  of  the  concept  of  a  policy  should  be  noted.  In 
treating  processes  in  these  terms,  we  reduce  a  multistage  decision 
process  to  its  essentials.  At  each  stage,  we  use  the  minimum  information 
required  for  decision  making,  and  make  the  minimum  number  of  decisions 
required  at  that  stage.  We  thus  retain  maximum  flexibility  for  future 
actions. 

These  considerations  are  important  in  the  design  of  computers,  auto- 
mata, and  servomechanisms,  and  in  neurophysiological  studies. 

POLICIES  WITHOUT  CRITERIA 

Once  the  focus  of  our  attention  has  been  shifted  from  criterion  functions 
to  policies,  we  are  in  a  position  to  study  many  important  classes  of  pro- 
cesses for  which  no  criterion  function  exists.  The  importance  of  this  in 
theory  and  application  can  scarcely  be  overestimated,  since  many  multi- 
stage decision  processes  occurring  in  the  business,  economic,  biological, 
and  military  realms  do  not  possess  explicit  criteria. 

Despite  this  apparently  paralyzing  fact,  it  is  quite  sensible  to  explore 
the  consequences  of  particular  policies.  A  policy  is  now  a  rule  for 
making  a  decision,  or  action,  in  terms  of  the  information,  or  stimuli, 
available. 

All  of  this  leads  us  naturally  into  the  study  of  simulation  processes,  a 
complex  and  intriguing  field.  For  a  discussion  of  these  matters,  see 
Ref.  [5],  where  further  references  may  be  found. 

ANALYTIC  AND  COMPUTATIONAL  RESULTS 

A  new  functional  equation  in  analysis  carries  with  it  the  standard 
questions  of  existence,  uniqueness,  and  analytic  and  computational 
solution. 

/.    Does  a  solution  of 

fN{p)  =max[g(p,  ql)  +  fN_1{T(p,  qj)] 

or  of  the  associated  approximate  equation, 

f(p)  =max[g(p,  q)  +f(T{p,  <?,))] 
qi 
exist,  and  is  it  unique? 


28  RICHARD  BELLMAN 

The  question  is  quite  important  since  we  must  establish  a  one-to-one 
correspondence  between  a  specified  subset  of  the  solutions  of  these 
equations  and  the  maximum  return  and  optimal  policy  of  the  original 
multistage  decision  process,  in  order  to  use  the  equations  to  analyze 
the  process. 

II.  Can  we  obtain  analytic  results  concerning  the  form  of  the  maximum 
return  and  the  structure  of  optimal  policies? 

This  is  an  important  line  of  investigation  since  we  wish  to  use  the 
solutions  of  simple  problems  as  approximations  to  the  solutions  of  com- 
plex processes.  Consequently,  our  aim  is  to  collect  a  library  of  decision 
processes  with  simple  intuitive  policies  to  use  as  jumping-off  places  for 
the  study  of  more  realistic,  complex  processes. 

III.  Can  we  use  the  foregoing  equations  to  obtain  numerical  determina- 
tion of  the  maximum  return  and  optimal  policy,  using  digital  computers? 

There  are  many  challenging  difficulties  encountered  in  the  computa- 
tional solution  of  equations  of  the  foregoing  type.  If  one  formulation  of  a 
problem  cannot  be  adequately  treated,  then  the  problem  is  that  of  re- 
formulating the  problem  so  that  it  can  be  treated  numerically. 

The  three  foregoing  questions  open  up  many  new  fields  of  mathematical 
research.  Discussions  of  some  of  the  results  already  known  and  of  some 
of  the  outstanding  problems  will  be  found  in  Refs.  [2-4]. 


CONTINUOUS  MULTISTAGE  DECISION  PROCESSES 

A  new  category  of  interesting  problems  arises  in  the  study  of  multi- 
stage decision  processes  of  continuous  type.  The  simplest  examples  of 
these  processes  arise  in  the  calculus  of  variations.  Consider,  for  ex- 
ample, the  maximization  of  the  functional, 


J(u)   =    J     g{u,  u')dt 


(10) 


over  all  functions  u(t),  0<t<T,  satisfying  the  initial  condition  u(0)  =  c. 
The  usual  approach  regards  the  unknown  function  u(t)  as  a  point  in 
function  space  and  uses  the  standard  variational  technique  to  derive  the 
Euler  equation, 

du        dt\du) 

This  second-order  differential  equation  must  be  solved  subject  to  the 
initial  condition  above,  u(0)  =  c,  and  the  new  condition  derived  from 
variational  considerations, 
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d£ 

du' 
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(12) 


t  -  T 


There  are  many  difficulties  associated  with  this  procedure  which 
render  it  unsatisfactory  in  many  important  processes;  see  Refs.  [3,  4]  for 
a  detailed  discussion.  In  the  next  section  we  shall  present  the  dynamic 
programming  approach  to  the  calculus  of  variations. 

DYNAMIC  PROGRAMMING  APPROACH 

As  pointed  out  above,  the  conventional  approach  of  the  calculus  of 
variations  regards  the  unknown  function  u(t)  as  a  point  in  a  suitable 
function  space  and  perturbs  about  this  "point"  to  obtain  the  Euler 
equation  (Figure  1).  Taking  cognizance  of  the  classical  duality  prin- 
ciple of  geometry  whereby  we  can  regard  a  curve  as  a  locus  of  points, 
or  as  an  envelope  of  tangents  (Figure  2),  we  see  that  we  can  regard  the 
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Figure  1.    Locus  of  points. 
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Figure  2.  Envelope  of  tangents. 


30  RICHARD  BELLMAN 

variational  problem  as  that  of  determining  an  optimal  direction  at  points 
along  the  curve.  We  thus  have  a  multistage  decision  process  of  contin- 
uous type  in  which  we  wish  to  determine  u\  the  decision  variable,  as  a 
function  of  the  state  variable  u,  and  the  duration  of  the  process  T  -  t. 


FUNCTIONAL  EQUATION 

Let  us  write 

f(c,  T)    =  min    f    g(u,  u')dt  (13) 

u     J0 

Using  the  inherently  additive  aspect  of  the  integral, 

a  -r 


r-s  -I 

•'o  Jo  Jl 


(14) 

0  "A 


we  obtain  from  the  Principle  of  Optimality  the  functional  equation, 

f{c,  T)=max[g(c,  w'(O)A)  +  f{c  +  Aa'(0),  T-A)]  +o(A)  (15) 

u'{0) 

for  small  A,  where  u'{0)  is  the  initial  slope  at  t  =  0,  a  function  of  c  and 
T.  Passing  to  the  limit  as  A  -»  0,  we  derive  the  nonlinear  partial  differ- 
ential equation 


d_ 
dT 


j7  =  max  \g(c,  v)  +  v  ^-  J  (16) 


(v  =  w'(0)),  with  the  initial  condition  f(c,  0)  =  0. 

From  this  equation  it  is  easy  to  derive  the  Euler  equation  in  a  number 
of  ways;  see  Refs.  [2,  3].  This  method  may  be  extended  to  cover  vari- 
ational problems  of  any  order  and  multidimensional  variational  processes. 
Finally,  let  us  point  out  that  this  approach  affords  a  simple  derivation  of 
Hamilton-Jacobi  theory  and  to  other  variational  principles  of  mechanics; 
see  Refs.  [6-8]. 


STOCHASTIC  AND  ADAPTIVE  PROCESSES 

We  do  not  wish  to  enter  here  into  the  details  of  the  more  recent  appli- 
cations of  dynamic  programming  to  the  study  of  stochastic  and  adaptive 
processes.  Let  us  merely  point  out  that  the  same  general  techniques  are 
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applicable,  at  the  expense  of  replacing  the  transformation  Tip,  q)  by  a 
stochastic  transformation  Tip,  q,  r)  and  the  previous  criterion  function 
by  one  involving  expected  values.  The  interested  reader  may  consult 
Refs.  [2,  33. 


QUASI- LINEARIZATION 

Consider  the  Riccati  equation, 

u'  =  u2  +  ait),     u(0)=c  (17) 

which  arises  in  many  analytic  and  physical  contexts.  Let  us  use  the 
well-known  and  simple  identity 

u2  =  max  (2uv  -  v2)  (18) 


Then  Equation  (17)  may  be  written, 

u' =  m8ix[2uv-v2  +  a(t)l     u(0)  =  c  (19) 

v 

We  have  thus  associated  the  Riccati  equation  with  a  continuous  multi- 
stage decision  process,  a  process  determined  by  Equation  (19).  Using 
the  concept  of  approximation  in  policy  space,  we  can  approach  the  study 
of  Equation  (17)  in  several  new  ways;  see  Refs.  [9,  10]. 

It  turns  out  that  many  of  the  important  equations  of  mathematical 
physics  can  be  transformed  in  a  similar  fashion,  and  that  in  this  way 
new  analytic  and  computational  results  can  be  obtained;  see  Refs.  [9, 10]. 


INVARIANT  IMBEDDING 

Many  of  the  basic  processes  of  mathematical  physics  can  be  viewed 
as  multistage  processes.  It  follows  that  a  functional  equation  approach 
similar  to  that  presented  above  for  decision  processes  should  yield  new 
and  significant  results.  This  approach  was  foreshadowed  by  the  " prin- 
ciple of  invariance"  of  Ambarzumian  and  Chandrasekhar  in  radiative 
transfer  [11],  and  extended  and  generalized  as  part  of  a  general  approach 
called  invariant  imbedding;  see  Refs.  [12-15]  for  applications  to  neutron 
transport  theory,  radiative  transfer,  scattering,  heat  conduction,  and 
wave  propagation. 
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CONCLUSION 

A  new  theory  inevitably  introduces  new  problems  along  with  solutions 
to  old  problems.  New  problems  and  new  theories  call  for  new  minds, 
generally  young  minds.  Since  the  problems  are  so  different  in  many  ways 
from  the  classical  ones,  there  is  no  guarantee  that  classical  erudition 
will  help  in  their  solution.  There  is  even  a  strong  possibility  that  the 
classical  directions  may  not  be  the  correct  ones  and  that  entirely  new 
routes  are  required.  Consequently,  too  strong  a  training  in  the  methods 
of  the  past  may  inhibit  the  untrammeled  play  of  the  imagination  which  is 
so  necessary  for  the  development  of  new  fields.  Put  all  of  this  together 
and  it  means  fascinating  opportunities  for  the  young  mathematician. 
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Axioms  for  Persistent  Preference* 


Consumers,  business  firms,  and  governments  make  many  economic 
decisions  from  which  they  expect  to  accrue  benefits  over  long  future 
periods.  It  makes  sense,  for  normative  and  perhaps  also  for  descriptive 
analysis,  to  represent  such  choices  in  terms  of  maximizing  a  utility 
function  under  constraints.  One  may  take  as  the  variables  for  such  a 
function  the  quantities  of  all  goods  available  for  consumption  during 
each  of  a  sequence  of  future  periods.  To  simplify  the  analysis  so  that 
the  quantity  of  capital  goods  remaining  at  the  end  of  the  last  period 
need  not  be  considered,  it  is  natural  to  make  utility  depend  on  an  infinite 
sequence  of  consumption  vectors.  Such  a  sequence  is  called  a  program. 

The  present  paper  studies  the  implications  of  five  axioms  that  may  be 
imposed  on  such  a  utility  function:  (1)  continuity,  (2)  sensitivity,  (3)  re- 
stricted independence,  (4)  stationarity,  and  (5)  existence  of  extreme 
programs.  Although  these  axioms  do  not  appear  to  make  any  distinction 
between  different  periods  in  the  future,  it  is  found  that  they  imply  a 
preference  for  benefits  in  the  immediate  future  over  equal  benefits  more 
distant  in  time  (Implications  3,  4,  and  5  below).  Moreover,  although  the 
postulates  speak  only  of  an  ordering  of  programs  by  a  utility  function 
defined  up  to  an  increasing  continuous  transformation,  the  axioms  make 
it  possible  to  specify  a  class  of  transformations  that  give  to  the  utility 
function  a  quasi-cardinal  property  (Implication  4). 


*Ed.  Note:  This  paper  was  presented  by  Koopmans.  Both  Koopmans  and 
Williamson  were  visiting  at  Harvard  at  the  time  of  the  symposium.  At  the  request 
of  Koopmans,  only  a  summary  is  presented  here;  for  more  detailed  treatment, 
see  Ref.  [l]. 
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NOTATION 


x,  xt   are  vectors  representing  the  amounts  of 
n  goods,  available  for  consumption  in 
some  period,  or  in  period  t,  respectively, 
where  t  -  1,  2,  .... 
X  is  the  set  of  all  possible  consumption 
vectors  x  or  xt.  This  set  is  independent 
of  t,  and  is  a  subset  of  an  rc-dimensional 
vector  space. 
t*  =  (xt,  x*  +  i,  ...)   is  an  infinite  sequence  of  consumption 
vectors  for  the  periods  t,  t  +  1,....  In 
particular: 
i*  =(«i,  2x)  =  (xi,  x2,  x3.  v..)    is  called  a  program. 

U(xx)   is  a  utility  function  defining  a  preference 
ordering  of  programs  in  the  sense  that 

Vdx)  l=\  l/dx')  indicates  that  xx 

(  preferred  ^ 
^equivalent  >  to  xx 
I    inferior  J 

J  =  [0,  1]   is  the  unit  interval  0  ^  U  ^  1. 


is 


AXIOMS 


1.  Continuity:  There  exists  a  utility  function  U^x)  defined  for  x*  €  X 
for  all  t,  continuous  for  the  norm  sup  \xt  -  x/l,  and  uniformly  con- 

tinuous  at  the  points  xx  of  any  set  on  which  U  ( ix)  is  constant.  The 
consumption  set  X  is  a  convex,  bounded  subset  of  Rn. 


In  this  axiom  the  choice  of  norm  makes  the  "distance* '  between  two 
programs  equal  to  the  maximum  "distance* *  between  the  two  vectors  for 
any  period,  thus  treating  all  parts  of  the  future  alike.  The  uniform  con- 
tinuity requirement  expresses  that,  for  any  two  equivalent  programs,  the 
utility  of  one  program  cannot  be  infinitely  more  sensitive  to  changes  in 
any  quantity  consumed,  than  is  the  utility  of  the  other  program. 

2.  Sensitivity:  There  exist  vectors  xu  xi  'and  a  sequence  2x  such  that 

U(xi,  2x)   >  U(xi%  2x) 
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This  axiom  says  that  utility  can  be  affected  by  changes  in  the  first- 
period  consumption  alone.  Taken  together  with  Axiom  4  this  implies  that 
utility  can  be  affected  by  change  in  the  consumption  of  any  single  future 
period. 

3.  Restricted  Independence:   For  all  xu  x/,  2x,  2X% 

ifU(xlf  2%)   ^  V(xi',  2x),  then   U(xu  ax')   ^  U(Xl',  2x') 
ifU(xlf  2x)   ^  U(xu  2x')>  then   U(Xl',  2x)     ^  U(x^  2x') 

This  axiom  implies  that,  if  a  given  first-period  consumption  (xi)  is 
preferred  to  another  (x/)  in  relation  to  a  given  sequel  (2x)  of  the  con- 
sumption program,  it  is  also  preferred  in  relation  to  any  other  sequel 
(2x').  This  expresses  what  the  economist  calls  "noncomplementarity" 
between  the  consumption  programs  for  period  1  and  the  rest  of  the  future. 

4.  Stationarity  for  Persistent  Preference):  For  some  xi  and  all  2x,  2x\ 

[/(%!,  2x)   *   U(xlf  2x')ifandonlyif[/(2x)   k  U(2x') 

This  axiom  expresses  that  preference  is  expected  to  be  independent  of 
calendar  time. 

5.  Extreme  programs:  There  exist  tx,  ix  such  that 

U(xx)  £  Udx)  ^Udaforall  xx 

This  axiom  is  included  to  avoid  certain  technical  complications  in  the 
reasoning.  If  not  initially  true,  it  can  be  made  true  by  a  decision  to  limit 
the  discussion  to  a  set  of  programs  ranked  between  two  designated  con- 
stant programs. 


IMPLICATIONS  OF  THE  AXIOMS' 


1.  The  ordering  of  programs  is  not  affected  by  an  increasing  trans- 
formation [/*dx)  =  0(Udx)).  This  can  be  used  to  make  Udx)  =  0^U(lX) 
^  l  =  U(ix)  (omitting  *'s). 


implications   1,2,3,  and  part  of  5  are  proved  in  Ref.  [l];  proofs  for  implica- 
tion 4  and  the  remainder  of  5  will  be  given  in  a  forthcoming  publication. 
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2.  There  exist 

•  (one-period  extreme)  vectors   x,xeX, 

•  a  continuous  (one-period  utility)  function  uix),  from  X  onto  J, 

•  a  continuous  function  V(u,  U)  from  /  x  /  onto  I,  increasing  in 
each  variable, 

such  that 

(a)  u(x)  =  0  ^  uix)  ^  1  =  u(x)  for  all  x  e  X, 

(b)  U(\x)  =V(u(x1),  Ui2x))  for  all   xx  =  (xu  2x), 

(c)  V(0,  0)=0,  V(l,  1)  =  1. 

It  follows  from  (b)  that  the  ordering  of  two  programs  is  known  if,  for  some 
T,  for  either  program  the  following  values  of  uix)  and  Udx)  are  known: 

•  (u  (xi),  u(x2),...,  u(xT))  =  iui,  u2,...,  uT)  =  iUT,  say 

•  l/(r  +  1x)  -  UT  +  i,  say. 

To  compare  the  two  programs,  evaluate 

•  U(lX)  =  V(ulf  V{u2,...,  ViuT,  t/r+i)  ...  ))  ^  Vt(iut,  Ut+i\ 
say,  for  either  program. 

3.  For  each  T  ^  1,  there  exists  a  continuous  function  WTdwT)  from  1T 
onto  J,  increasing  in  each  ut,  t  =  1,...,  T,  such  that 

ifU  J!|wT(lWT)   then   U  U\vT{lUT,  U)  /|\wT(lMr) 

Wr(xMr)  is  the  utility  level  of  a  program  whose  ordering  is  not  affected 
by  a  postponement  by  T  periods  if  consumption  vectors  of  one-period 
utilities  uu  w2,...,mt  are  inserted  in  the  periods  vacated.  The  above 
inequalities  imply  that  if  a  program  with  a  utility  level  in  excess  of 
W(xMr)  is  postponed  by  T  periods,  in  which  one-period  utilities  U\, ..., 
uT  are  inserted,  then  the  new  utility  level  will  be  still  in  excess  of  but 
closer  to  WiiUT)  than  the  old  one.  These  two  statements  taken  together 
give  the  first  indication,  in  an  ordinal  manner,  that  changes  in  some 
programs  affect  utility  less  if  made  in  a  more  distant  future  than  if  made 
in  the  present  or  in  the  near  future. 

WT(jwr)  is  also  the  utility  of  a  program  in  which  a  finite  sequence  of 
vectors  %i,...,  xt  such  that  u(xt)  =  ut  is  repeated  infinitely  often.  One 
can  choose  a  continuous  increasing  transformation  of  u(x),  from  J  onto  J, 
which  makes  Wi(m)  =  u,  hence  V(U,  U)  =  U.  An  example  with  T  =  3  in 
which  this  transformation  has  been  made  is  shown  in  Figure  1. 
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V{uvU) 
V{uvU) 
V{uvU") 


or 


Figure  1.    Illustration  of  implication  3. 


4.  There  exists  a  continuous  increasing  transformation  <£,  /  onto  I, 

U*(lX)  =  <£  «/dx)),  V*(u,  U*)  =  cf>  {Viu,  pHU*)) 

such  that  (omitting  *'s) 

if  U  <  U\  then  0  <  Viu,  U')  -  Viu,  U)<U'  -U  for  all  u 

This  transformation  confers  upon  the  initially  ordinal  utility  Udw)  a 
quasi-cardinal  character.  Certain  utility  differences  are  being  compared, 
and  a  time  perspective  is  found  to  affect  utility  comparisons:  given  pro- 
gram differences  more  distant  in  time  entail  smaller  utility  differences. 

5.  If 


are  such  that 


\U>t,  t  +  iUt+s,  U,  U,  U 
Wrdwr)  >Ws(r+iWr+s) 

u<u<u 


then 
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U  =  max  |  0,  solution  of  Vt(iUt,  U)  =  Ws(T  +  iwr+s)  } 
U  =  min    |  1,  solution  of  Vs(r  +  iWr+s,  U)  =  Wt(iUt)  \ 

Vt(iut,  Vs(r  +  iMr+s,  V))  >  Vs(t+iUT+s,  Vt(iUT,  U)) 
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This  property  has  been  called  impatience.  It  says  that  if  one  program  is 
obtained  from  another  by  interchange  of  two  finite  vector  sequences, 
then  the  program  having  the  "better"  sequence  first  is  preferred. 

An  example  for  T  =  S  =  1,  so  xut  =  U\,  r+i^r+s  =  u>i>  u\  >  ^2,  is 
shown  in  Figure  2. 


V{uvuvU) 


0  UL 


1         u 


Figure  2.  Example  of  impatience;  T  =  S  =  1. 
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SIGEITI    MORIGUTI* 


Further  Results  in  the 
Theory  of  Numerical 
Convergence 


The  purpose  of  this  paper  is,  first,  to  summarize  the  results  of  pre- 
vious papers  [1,  2]  in  an  expository  manner,  and  second,  to  present  an 
extension  toward  random  rounding. 

My  interest  in  numerical  convergence  arose  from  the  following 
observation. 

Suppose  we  solve  the  simultaneous  linear  equations 
7x  +    y  +  2z  =  10 

x  +  8y  +  Sz=    8  (1) 

2x  +  3y  +  9z  =    6 
by  iteration,  using  the  numerical  scheme 

x»+i   =  (10-yn-2zn)/7 

yn+i  =  (s-xn-Szn)/s 

(2) 
zn  +  i   =  (6-2xn-3Vn)/9 

(n  =  0,1,2,...) 

starting  with  the  initial  values  x0  =  y0  =  z0  =  0;  The  dots  on  the  equality 
symbol  mean  equality  after  rounding.  If  we  round  each  result  up  to  the 
third  decimal  place,  we  obtain  the  sequence  shown  in  Table  1. 


♦Research  sponsored  by  the  Office  of  Naval  Research  under  Contract  No. 
Nonr-266(33),  Project  No.  Nr  042-034.  The  author  is  indebted  to  Professor 
Herbert  Robbins  for  his  valuable  suggestion  about  random  rounding,  which  led 
to  this  work. 
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TABLE  1 
Successive  Approximations 


n 

X 

y 

z 

n 

n 

n 

0 

0 

0 

0 

1 

1.429 

1.000 

0.667 

2 

1.095 

0.571 

0.016 

3 

1.342 

0.857 

0.233 

4 

1.240 

0.745 

0.083 

5 

1.298 

0.814 

0.143 

6 

1.271 

0.784 

0.107 

7 

1.286 

0.801 

0.123 

8 

1.279 

0.793 

0.114 

9 

1.283 

0.797 

0.118 

10 

1.281 

0.795 

0.116 

11 

1.282 

0.796 

0.117 

12 

1.281 

0.796 

0.116 

13 

1.282 

0.796 

0.117 

14 

1.281 

0.796 

0.116 

The  behavior  of  this  process  in  the  earlier  stages  can  be  well  under- 
stood by  the  theory  of  difference  equations;  the  eigenvalue  with  the 
largest  absolute  value  is  -0.504  in  this  case.  But  the  oscillation  in  the 
final  stages  (xn  between  1.282  and  1.281,  zn  between  0.117  and  0.116) 
can  never  be  understood  unless  we  take  account  of  the  effect  of  rounding. 
This  effect  appears  when  the  process  comes  close  to  the  true  solution, 
which  in  this  case  is 

x  =  132/103  =  1.28155 

y  =     82/103   =  0.79612  (3) 

z  =     12/103   =  0.11650 

One  naturally  gets  the  idea  of  a  change  of  variables: 

x  =  1.281  *  0.001  u 

y  =  0.796  +  0.001  v  (4) 

z  =  0.116  +  0.001  w 

which  will  reduce  Equation  (2)  to 

un  +  1   =  (5  -  vn  -  2wn)/l 

vn+1   =  (3  -  un  -  3wn)/S  (5) 

wn  +  1   =  (6-2Wn-3i,n)/9 

Here  the  rounding  of  un  +  u  etc.,  refers  to  rounding  to  the  nearest  integer. 
Hence  un  =  vn  =  wn  =  0  (corresponding  to  n  =  12  or  14)  leads  to  un  +  i  = 


42  SIGEITI  MORIGUTI 

5/7  =  1,  i;n  +  1  =  3/8  =  0,  wn  +  1  =  6/9  =  1.  Likewise,  un  =  1,  vn  =  0,  wn  =  1 
leads  to  un  +  1  =  3/7  =  0,  un  +  1  =  -1/8  =  0,  wn  +  1  =  4/9  =  0.  Thus  the 
process  oscillates  between  (0,0,0)  and  (1,0,1).  On  the  other  hand,  (0,0,1) 
will  lead  to  (0,0,1),  and  (1,0,0)  to  (1,0,0),  thus  providing  two  different 
stationary  states.  Similar  phenomena  can  also  occur  in  the  Gauss-Seidel 
process. 

Another  interesting  example  of  terminal  oscillation  caused  by  rounding 
is  the  following.  In  Newton's  iteration  for  the  square  root  of  a, 

*n  +  l    =  IT  (*n    +  ■—-   |  H-  0,1,2,...  (6) 


4(*»+i) 


If  we  use  five-decimal-place  arithmetic  and  if  our  rounding  rule  is  a 
simple  truncation  ("chopping-off"),  then  for  a  =  0.24999  the  process  will 
oscillate  indefinitely  between  x  =  0.49998  and  x  =  0.49999. 

This  kind  of  terminal  oscillation  is  not  a  trouble  at  all  for  a  human 
computer,  because  the  repetitive  pattern  can  be  recognized  quite  easily. 
But  it  is  a  nuisance  for  an  electronic  computer,  because  it  may  stupidly 
continue  the  process  indefinitely  unless  this  possibility  is  anticipated 
and  prepared  for. 

A  rather  comprehensive  general  discussion  of  numerical  iteration  has 
been  presented  by  Urabe  [3],  and  the  results  presented  in  the  following 
three  sections  of  this  paper  can  be  regarded  as  concrete  examples  of  his 
theory. 

FUNDAMENTAL  NOTIONS 

A  digital  number  is  defined  as  an  integral  multiple  of  a  certain  quan- 
tity 6  (e  being,  e.g.,  10" 10  for  ten-decimal-place  arithmetic)  which  may  be 
called  the  quantum  of  the  arithmetic.  Only  digital  numbers  can  properly 
be  stored  and  handled  by  a  digital  computer.  See,  e.g.,  Ref.  [4],  Chapter 
1.  A  digital  number  is  represented  notationally  by  a  bar,  as,  3c. 

A  theoretical  iterative  process 

Xn  +  l     =    0(*n)         71  =  0,1,2,...  (7) 

has  as  its  numerical  counterpart 

*n  +  l     =    0(xn)         Al  =  0,l,2,...  (8) 

which  determines,  for  any  given  digital  number  x o  as  the  starting  value, 
a  sequence  of  digital  numbers.  If  the  sequence  does  not  include  the 
starting  value  x0  again,  then  xq  is  called  transient.  If  the  sequence  does 
include  x0  again,  then  it  is  called  recurrent.  In  particular,  if  all  values 
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in  the  sequence  are  equal  to  x0,  then  xq  is  stationary.  A  recurrent  digital 
number  which  is  not  stationary  belongs  to  a  recurrent  cycle  of  length 
^  2.  Some  of  this  terminology  comes  from  the  theory  of  Markov  chains 
(e.g.,  Ref.  [5],  Chapter  15). 

In  Chung's  terminology  [6]  "transient"  and  "recurrent"  are  replaced 
by  "inessential"  and  "essential,"  respectively,  and  only  "essential 
non-null"  states  are  called  "recurrent."  In  our  case,  however,  all  es- 
sential states  are  non-null. 

We  say  the  process  is  in  the  state  of  numerical  convergence  [3]  if  a 
recurrent  point  has  been  reached. 


SUMMARY  OF  RESULTS  OBTAINED  PREVIOUSLY  FOR  A 
LINEAR  EQUATION  IN  ONE   UNKNOWN 

A  good  insight  into  the  behavior  of  numerical  iterative  processes  is 
obtained  from  investigating  closely  a  very  simple  example,  namely,  the 
iterative  solution  of  a  linear  algebraic  equation  for  one  unknown.  See 
Ref.  [1],  pp.  315-319. 

Let  the  process  be 

xn  +  i   =  a-bxn      rc  =  0,l,2,...  (9) 

where  a  and  b  are  given  constants.  The  process  converges  ("theoreti- 
cally") if  and  only  if  \b\  <  1.  The  corresponding  numerical  process  is 

xn  +  i    =  a-bxn      n  =  0,  1,  2,...  (10) 

where  we  assume  that  =  denotes  equality  after  a  simple  truncation. 1 

For  a  given  a,  b,  we  can  uniquely  determine  an  integer  X  and  a  number 
A,  such  that 

a  =  1(1  +  6)  X  + AW  (11) 

0<A<l  +  b  (12) 

where  e  is  the  "quantum"  of  the  arithmetic.  Now  we  change  the  variable 
from  x  to  u  by  the  relation 

x   =   (X  +  u)e  (13) 


truncation,  as  distinguished  from  rounding  to  the  nearest  digital  number 
(or,  in  terms  of  u,  rounding  to  the  nearest  integer),  does  not  materially  affect 
the  situation  —  in  terms  of  Figure  1,  it  amounts  to  no  more  than  a  shift  of  the 
vertical  axis  by  one-half. 
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In  terms  of  u,  the  numerical  process  (10)  becomes 

"n  +  l     =    A-bHn         71  =  0,1,2,... 

Thus  we  have  reduced  the  problem  to  the  case  where  the  quantum  of  the 
arithmetic  is  equal  to  unity  and  the  constant  term  A  is  restricted  in  the 
interval  from  0  to  (1  +  b). 

The  result  of  the  investigation  (Ref.  [1],  Theorem  3,  p.  319)  is  sum- 
marized in  Figure  1. 


Bz  (-3^3) 
Sz(-2—  2) 
&   (-1  ^=1) 


b    0 


[  ]  =  Stationary 

(=^)  =  Cycles. 

^^  ^(-1,0) 

£2(-2,0) 

■ £3(-3,0) 

Figure  1.    Results  for  xn  +  1 


bx. 


If  the  point  (A,  b)  falls  inside  the  arrowtail-shaped  area  A0,  then  the 
only  recurrent  point  is  0.  If  the  point  lies  in  the  arrowhead  area  Bj,  then 
there  are  two  recurrent  points,  -1  and  1,  constituting  a  cycle  of  length 
2.  Besides  these,  the  point  0  is  also  recurrent  and  in  fact  stationary; 
however,  this  point  will  not  be  reached  from  outside,  i.e.,  it  is  obtained 
only  by  starting  at  0.  Similar  considerations  apply  to  areas  B2,  B3,....  If 
the  point  lies  in  Cu  C2,  C3...,  then  the  terminal  state  will  be  an  oscil- 
lation between  0  and  1;  again  the  recurrent  cycle  is  of  length  2. 
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If  the  point  falls  in  the  inverted-kite-shaped  area  Du  then  two  sta- 
tionary points  exist,  -1  and  0.  The  first  is  reached  when  approaching 
from  the  left,  the  second  when  approaching  from  the  right.  If  the  point  is 
in  D2,  then  -2  and  0  are  the  two  stationary  points  usually  obtained.  The 
intermediate  value,  -1,  is  also  a  stationary  point,  however,  it  will  never 
be  reached  unless  one  is  at  -1  from  the  beginning.  Similar  results  are 
obtained  for  D3,  D4,... 

The  areas  A0,  Bk,  Ck,  Dk  (k  =  1,2,3,...)  make  up  the  entire  triangular 
region  of  interest  (Equation  (12)). 

Of  particular  interest  is  the  fact  that,  however  small  b  may  be,  i.e., 
however  close  to  the  horizontal  axis  the  point  (A,  b)  may  lie,  there  is 
always  a  possibility  of  having  two  recurrent  points,  either  constituting  a 
recurrent  cycle  of  length  2  (0  2  1  in  area  Ci)  or  constituting  a  pair  of 
adjacent  stationary  points  (0  and  -1  in  area  DJ.  The  existence  of  a 
cycle  is  important  in  programming  for  electronic  computers,  as  we  have 
already  indicated.  The  existence  of  more  than  one  stationary  point  is 
also  important,  because  it  means  that  mere  convergence  does  not  tell  us 
the  smallness  of  the  error. 

After  acknowledging  the  possibilities  of  more  than  one  recurrent  point, 
we  can  be  happy  to  observe  from  Figure  1  that,  as  long  as  we  restrict 
the  coefficient  b  in  the  interval  -1/2  <  b  <  1/2,  the  only  contingencies 
are  A0  (with  0  as  the  unique  stationary  point),  C\  (with  0  and  1  consti- 
tuting a  cycle),  and  D1  (with  -1  and  0  as  a  pair  of  stationary  points). 
We  had  better  bear  this  conclusion  in  mind  whenever  we  can  control  the 
value  of  b.  See,  for  example,  the  following  section. 


GENERALIZATIONS  TO  SEVERAL   UNKNOWNS 
AND  NONLINEAR  EQUATIONS 

In  the  second  paper  [2],  the  result  discussed  in  the  preceding  section 
was  generalized  to  the  case  of  several  unknowns  and  to  the  case  of 
nonlinear  equations. 

Let  x  =  [xi]  and  a  =  Vai\  be  p  vectors  and  let  b  =  [6fj]  be  a  (p  x  p)  matrix. 
The  iterative  process, 

x(n  +  l)     =    a_fo(n)         „  =  Q,   1,2,...  (14) 

has  as  its  numerical  counterpart 

3c(n  +  1)   =  a~bx(n)      rc  =  0,  1,  2,...  (15) 

where  the  bar  on  the  x  now  indicates  that  x  is  a  vector  whose  components 
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are  digital  numbers.  Let  us  assume  this  time  that  the  rounding  rule  is  to 
the  nearest  digital  number. 
The  condition 

JT  |bii|ir<l,forall/=l,...,p  (16) 

i  =  i 
is  sufficient  for  the  process  (14)  to  converge  to 

f  =  (1  +  b)'1  a  (17) 

Let  us  define,  for  any  p  vector  x, 

|  |x||   =      max     |x,|  (18) 

We  have  a  theorem  (Ref.  [2],  Theorem  3.1)  stating  that,  under  con- 
dition (16) 

for  any  recurrent  point  x.  Here,  as  before,  e  is  the  quantum  of  the  arith- 
metic. 

As  a  corollary,  we  can  prove  that,  under  condition  (16)  for  any  two 
recurrent  points  x i  and  %2, 

||5I-xa-||S;IlF  (20) 

This  inequality  gives  an  upper  bound  to  the  amplitude  of  the  oscillation 
which  may  occur  in  the  state  of  numerical  convergence.  In  particular,  if 
r  <  1/2,  then  Equation  (20)  implies  \\xi  -x2\\  <  2e,  from  which 

l|xi-x2||^  (21) 

—because  both  x  i  and  x2  are  digital  vectors  and  consequently  any  com- 
ponent of  xi  -  x2  must  be  a  multiple  of  e.  Thus,  if  Equation  (16)  is 
satisfied  with  a  value  of  r  <  1/2,  then  all  recurrent  points  are  contained 
in  a  cube  with  edges  equal  to  e,  the  quantum  of  the  arithmetic.  The  cube 
contains  the  true  root  either  within  itself  or  on  its  boundary,  as  is  im- 
plied by  Equation  (19). 

The  above  theorem  holds  also  for  the  Gauss-Seidel  iteration  where  we 
use  in  each  step  the  most  recently  obtained  values  of  the  components  — 
e.g.,  we  use  x(1n  +  1)  rather  than  x[n)  in  computing  x2n  +  1). 

The  generalization  to  the  case 

X(«+D   =  0(x<n>)      7i=0,l,2,...  (22) 

in  place  of  the  linear  relation  (14)  is  straightforward.  The  condition 


;=i 


d<f>i 

dXi 


£r<l,  for  all  /  =  !,.. .,p  <23> 
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takes  the  place  of  condition  (16).  This  condition  (together  with  certain 

regularity  conditions)  will  ensure  that  both  Equations  (19)  and  (20)  hold. 

As  an  application  of  these  results,  consider  the  "predictor-corrector" 

method  of  solving  ordinary  differential  equations.   Take,  for  example, 

%  =  f{x>y]  (24) 

for  one  unknown  function  y  and  one  variable  x,  and  suppose  that  we  use 
the  trapezoidal  rule.  Then,  at  the  feth  step,  having  obtained  y0,  yi,--',yk 
for  x  =  x0,  Xi,.^.9Xk,  we  compute  yk+i  for  x=xk+i  =  xk  +  h  by  the  follow- 
ing iterative  process: 

yJVi   =  y*-i+2tfCxby*)  (25) 

y&V  "  y«  +t  \f(x*>  *J +  f(**+»  yftJ )     «-o.ia...    <26> 

Here  we  are  interested  in  the  " corrector,* '  Equation  (26).  The  corre- 
sponding numerical  process  is 


=  yic+-|  {flxtoyj  +  fixk+uyi"^)}       7i=0,l,2,...        (27) 


Jt+1 


where  the  bar  upon  y  indicates  that  it  is  a  digital  number.  Now  we  can 
apply  the  above  result  and  know  that,  if 

4|I|<r<l  (28) 

in  the  region  of  interest,  then  Equations  (19)  and  (20)  hold.  If,  moreover, 
r  <  1/2,  then  there  are  at  most  two  recurrent  points,  adjacent  to  each 
other. 

Usually  the  interval  h  of  one  step  is  chosen  so  that  the  left-hand  side 
of  Equation  (28)  will  be  sufficiently  small.  I  believe  it  to  be  good  prac- 
tice to  make  it  less  than  1/2. 

Similar  discussions  apply  in  the  case  of  Milne's  method  (Ref.  [7], 
Method  VII),  and  in  the.  case  of  simultaneous  differential  equations  — 
see  Ref.  [2]  for  more  details. 

RANDOM  ROUNDING 

Now  let  us  return  to  the  simple  problem  of  a  linear  equation  in  one 
unknown,  discussed  previously.  The  numerical  process  again  is 

3cn  +  1   =   a~bxn      rc  =  0,l,2,...  (29) 

where  a,  b  are  given  constants,  \b\  <  1,  and  the  bar  upon  x  indicates  that 
it  is  a  digital  number.  This  time,  however,  we  assume  that  the  rounding 
rule  is  not  deterministic  but  random. 
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For  the  sake  of  simplicity,  let  us  assume  that  the  "quantum"  of  the 
arithmetic  is  1,  and  the  value  of  a  has  been  standardized  to  be  in  the 
region  (cf.  Equation  (12 )) 

0  ^  a  <  1  +  b  (30) 

The  rounding  rule  we  use  here  is  illustrated  as  follows:  Suppose 
a  =  0.3,  b  =  0.2,  and  suppose  xn  =  0.  Then  a- bxn  =  0.3  -  0.2  (0)  =  0.3 . 
In  this  case,  we  take 

xn  +  i    =  1  with  probability  0.3 

Vol) 

=  0  with  probability  0.7 

In  order  to  perform  this  random  rounding,  it  is  convenient  to  do  the  fol- 
lowing operations: 

1.  Choose  a  random  number  £  with  uniform  distribution  over  the  inter- 
val (0,  1). 

2.  Add  £  to  the  result  of  the  computation  (in  our  case  0.3). 

3.  Chop  off  the  fractional  part  and  get  an  integer.  £  will  be  <  0.7  with 
probability  0.7,  and  ^0.7  with  probability  0.3,  thus  fulfilling  the  pre- 
scription of  Equation  (31). 

To  illustrate  how  this  rounding  rule  will  affect  the  behavior  of  the 
numerical  process  (Equation  (29)),  let  us  perform  an  experiment  with 
a  =  0.3,  b  =  0.2.  Several  tables  are  available  as  the  source  of  random 
numbers  £0,  f  ,  £2,...  to  be  used  in  the  process.  The  results  shown  in 
Table  2  use  the  table  in  the  Appendix  of  Ref .  [8] . 

TABLE  2 
Result  of  A  Random  Experiment' 


n 

X 

a-bx 

a-bx    +£ 

n 

n 

n     ^n 

0 

100 

-19.7 

0.73 

-18.97 

1 

-19 

4.1 

0.08 

4.18 

2 

4 

-0.5 

0.14 

-0.36 

3 

-1 

0.5 

0.31 

0.81 

4 

0 

0.3 

0.83 

1.13 

5 

1 

0.1 

0.06 

0.16 

6 

0 

0.3 

0.91 

1.21 

7 

1 

0.1 

0.02 

0.12 

8 

0 

0.3 

0.39 

0.69 

9 

0 

0.3 

0.23 

0.53 

10 

0 

0.3 

0.87 

1.17 

11 

1 

0.1 

0.54 

0.64 

12 

0 

0.3 

0.91 

1.21 

13 

1 

0.1 

0.30 

0.40 

14 

0 

0.3 

0.32 

0.62 

15 

0 

0.3 

0.39 

0.69 

a  =  0.3;    b  -  0.2;    £  is  uniformly  distributed  (0,1). 
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We  observe  that,  after  a  few  steps,  the  process  begins  to  oscillate  (in 
a  somewhat  irregular  fashion)  between  two  values  0  and  1.  In  fact,  in 
this  example,  as  soon  as  x  becomes  either  0  or  1,  it  will  never  get  out 
of  these  two  values.  In  this  final  stage,  the  process  behaves  as  indi- 
cated in  Figure  2.  The  transition  probabilities  are: 

*°  0  1 


from  ^ 
0 

0.7 

0.3 

1 

0.9 

0.1    ! 

Figure  2.    Final  stage  oscillation. 

The  process  is  "positively  regular"  in  the  sense  of  Frechet  [9].  The 
(unique)  stable  probability  distribution  is  given  by  (m.o,  Wi),  which 
satisfies 

w0   =  0.7  mo  +  0.9  mi 

Mi    =  0.3  mo +  0,1  mi  (32) 

M0  +  Mi     =1 

The  solution  is 

w0   =  3/4,     mi    =   1/4  (33) 

The  expected  value  of  xn  will  approach,  as  n  -*  °°,  the  value 

0  x  mo   +   1  x  mi   =   1/4  (34) 

which  happens  to  coincide  with  the  true  root  f  of  the  equation  £=  a-bf. 
Notice  that  f  =  a/(l  +  b)  =  0.3/(1  +  0.2)  =  1/4  in  the  above  example. 

This  coincidence  is  not  due  to  mere  chance.  In  order  to  see  that  this 
is  necessarily  so,  let  us  express  the  numerical  process  (29)  in  the  more 
explicit  form: 

*n  +  i   =  a-bxn+rjn      m  =  0,  1,2,...  (35) 
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The  rounding  error  77  n  in  each  step  is,  in  the  present  case,  a  random 
number  with  the  property 

E{rjn)   =  0  (36) 

Consequently,  we  have 

E(3cn  +  1)   =  a-bE{xn)      h  =  0,1,2,...  (37) 

from  which  we  can  derive  (under  the  assumption  that  161  <  1)  the  con- 
clusion that 


lim  E(xn)   =  f,/ 


(38) 


This  is  a  happy  conclusion  and  may  have  some  practical  significance 
for  high-speed  computers.  We  shall  not  go  any  further  into  this  here,  but 
rather  will  concentrate  our  attention  on  the  behavior  of  the  process  at 
the  final  stage  —  the  state  of  numerical  convergence.  The  result  is  pre- 
sented in  Figure  3  and  will  be  proved  in  the  next  section. 


Figure  3.    Results  for  xn+1  =  a  —  bxn  when  using  random  rounding. 
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Comparing  Figure  3  with  Figure  1,  we  find  that  the  arrowtail  shaped 
area  A0  in  Figure  1  is  now  divided  into  two  parts.  The  larger  part  ab- 
sorbed all  the  areas  Du  D2,...  to  constitute  a  big  parallelogram.  If  the 
point  (a,  b)  falls  in  this  parallelogram,  then  the  two  points  0  and  1  are 
the  only  recurrent  points  and  the  process  will  oscillate  between  these 
values  in  much  the  same  fashion  as  in  the  above  illustrative  example.  If 
the  point  falls  in  the  smaller  part  of  former  region  A0  (the  triangular  area 
with  one  side  on  the  vertical  axis),  then  there  are  three  recurrent  points, 
-1,  0,  and  1,  and  a  unique  stable  probability  distribution  over  them 
exists. 

Likewise,  the  former  area  Bj  is  now  split  into  two  parts,  one  of  which 
has  four  recurrent  points,  -1,0, 1,  and  2;  the  other,  five  recurrent  points, 
-2,  -1,  0,  1,  and  2.  The  case  is  similar  for  the  other  Bk's. 

The  former  area  Ci  is  now  split  into  two  parts,  in  one  of  which  we 
have  three  recurrent  points,  0,  1,  and  2,  and  in  the  other,  four  recurrent 
points,  -1,  0,  1,  and  2,  and  so  on. 

In  each  of  these  areas,  the  recurrent  points  as  a  whole  constitute  a 
single  "ergodic  set"  without  cyclic  subdivision. 

The  possibility  of  two  or  more  stationary  points  side  by  side  has 
completely  disappeared.  This  fact  is  quite  interesting.  It  implies  that 
the  amount  of  the  error  (from  the  true  root)  will  always  "show  up"  in  the 
final  erratic  oscillation.  This  is  certainly  a  good  feature. 


PROOF  OF  THE  RANDOM  ROUNDING  RESULTS 

In  this  section,  we  use  a  formula  of  the  form  of  Equation  (35)  to  indi- 
cate the  numerical  process  in  question.  However,  we  will  not  use  the 
bar  upon  x  to  indicate  that  it  is  a  digital  number  (an  integer  in  the  pre- 
sent discussion).  Thus,  we  have 


xn  +  1   =  a-bxn   +  Tfn      7i=0,l,2,... 


(39) 


where  xnU  =  0, 1,2,...)  are  integers.  If,  for  a  given  xn,  a-bxn  turns  out 
to  be  somewhere  between  two  integers  (Figure  4),  then  these  two  in- 
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Figure  4.    Geometry. 
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tegers  will  be  denoted  by  xn  +  1  and  Xn  +  iUn  +  i  +  D.  We  denote  correspond- 
ing rounding  errors  by  iqn  and  rjn.  The  random  variable  rjn  in  Equation 
(39)  takes  on  the  value  rfn  with  probability  -  77  n  and  the  value  77  n  with 
probability  rjn.  If  a-bxn  turns  out  to  be  an  integer,  then  we  take  both 
xn  +  i  and  xn  +  i  to  te  that  integer  and  put  rjn  =  jjn  =  0.  In  this  case  the 
random  variable  77  n  takes  the  value  0  with  probability  1.  Thus  rjn  andj7n 
satisfy  the  inequalities: 


0  <     Vn     <     1 

1  <     Vn     ^    0 


(40) 


Let  us  discuss  first: 
Case  1:   -1  <  6  <  0,  0  <  a  <  1  +  b 

Lemma  1:  In  Case  1,  1  ^  xn  implies  0  ^  xn+i  <  xn,  and  xn  ^  0  implies 
xn  <xn  +  i  SI. 

Proof:  Suppose  xn  ^  1.   Then 

*n  +  i   =  a-6xn  +^n 

(41) 

<  (1  +  6)-  bxn  +0  =  jcn-(l  +  6)(xn-l)^xn 

Xn+i  >  0  -  bxn  -  1  ^  -  1,  which  implies  xn  +  1  ^0  (42) 

Notice  that  xn  +  1  is  an  integer. 
Next  suppose  xn  ^  0.  Then 


(43) 


(44) 

which  implies  that  3cn  +  1  ^  1. 

This  lemma  implies  that,  if  the  point  xn  is  to  the  right  of  1,  then  there 
is  a  positive  probability  of  xn  +  i  going  strictly  to  the  left  of  xn  (but 
never  beyond  0).  Likewise,  if  xn  is  to  the  left  of  0,  then  there  is  a  posi- 
tive probability  of  xn  +  1  going  strictly  to  the  right  of  xn  (but  never  beyond 
1).  For  xn  =  0,  xn  +  i  =  1  and  x„+1  =  0;  and  for  xn  =  1,  3cn  +  1  =  1  and 
xn  +  1  =0.  Thus,  starting  from  any  integer  value,  either  positive  or  nega- 
tive, there  is  a  positive  probability  of  the  process  going  to  0  and  1  in 
finite  steps.  On  the  other  hand,  once  the  process  gets  at  0  or  1,  there  is 
no  chance  of  its  getting  out  of  those  two  values.  Hence  0  and  1  are  the 
only  recurrent  points.  Since  all  the  transition  probabilities  from  0  or  1  to 
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=  a-bxr 

+  Vn 

>   0-bxr 

+  0  =  xn 
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0  or  1  are  positive,  this  is  "positively  regular' *  in  the  sense  of 
Frechet[9]. 

Next,  let  us  consider: 

Case  2:   0  <   b  <  1,   0  <  a  <  1 . 

Lemma  2:  In  Case  2,  %n  >  1  implies  ~(xn-l)  ^  xn  +  1  <  1,  and  xn  £0 
implies  0  <  *n  +  i^  (-xn). 

Proof:  Suppose  xn  ^  1.   Then 


(45) 


(46) 


(47) 


xn  +  1    =   a~bxn  +7/n 

>   0-bxn+0=-xn  +  {l-b)xn>-xn 
which  implies 

xn  +  1   <   \-bxn  +  1  <  2,  which  implies  that  xn  +  1  ^  1 
Next,  suppose  xn  ^  0.   Then 

x„  +  1    =  a-bxn  +  *?n 

<l~bxn  +  0  =  (.Xn)+(l-b)x„  +  lS(-xn)  +  l 
which  implies  that  xn  +  1  ^  (-xn). 

*n  +  i  >  0-fex„-l  i  -1,  which  implies  that  xn  +  1  ^  0  (48) 

This  lemma  implies  that,  with  positive  probability,  the  process  will 
reduce  the  magnitude  at  least  by  1  in  two  iterations  until  it  reaches  0. 
Thus  any  integer  will  lead  to  0  with  positive  probability  (any  integer 
~"  0,  in  Chung's  symbol  [6]. 

It  is  easy  to  see  that  0^1.  But  how  far  can  we  go  back  to  other 
points?  In  order  to  answer  this  question,  we  need: 

Lemma  3:  Suppose  k  is  a  positive  integer.  Then  xn  =  k  implies 
Xn  +  i  ^  -k,  with  equality  holding  if  and  only  if  1-a  >  k{l-b);  and 
xn  =  -k  implies  xn+1  ^  k  +  1,  with  equality  holding  if  and  only  if 
a>k(l-b). 

Proof:   Suppose  xn  =  k  >  0.    Then 

xn+1   =  a- bk  +J2n  >  0-fe-l,  which  implies  xn  +  1  ^  -k  (49) 

and  Xn  +  i  =-k   if  and   only   if  a-bk  <-k  +  l,  which  is   equivalent  to 
l-a>fe(l-tt. 

Next,  suppose  xn  =  -k  <  0.    Then 

xn+i  -a  +  bk  +rjn<l  +  k  +  1  -k  +  2  (50) 

which  implies  that  xn  +  i  ^  k  +  1;  and  xn  +  1  =  k  +  1  if  and  only  if  a  +  bk  >  k, 
which  is  equivalent  to  a  >  k  (1-6). 
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This  lemma  implies  that,  starting  from  1,  there  is  a  positive  proba- 
bility of  going  to  -1,  then  to  +  2,  then  to  -2,  then  to  +3,  and  so  on,  as 
long  as  the  corresponding  inequalities 


1  -  a  >  1  -  6,  a  >  1  -  6, 

l-a>2(l-6),     a  >  2(1-6),  etc, 


(51) 


are  satisfied.  As  soon  as  either  I- a  >  Ml -6)  or  a  >  fe(l-6)  cease  to 
be  satisfied,  the  process  will  fail  to  lead  to  any  further  points.  Only 
those  points  that  are  covered  in  this  process  of  " going  backward"  are 
recurrent,  because  only  those  are  reachable  from  0,  which  is  reached 
from  any  point.  These  considerations  serve  as  the  proof  of  the  result 
shown  in  the  unit  square  0  <  a  <  1,  0  <  6  <  1  of  Figure  3;  notice  that  the 
two  pencils  of  straight  lines  in  the  diagram  correspond  to  1 -a  =  Ml -6) 
and  a  =  Ml -6),  where  k  =  1,2,.... 

It  is  not  difficult  to  prove  that,  if  -k  ~  k  or  -(fe-1)  ~  k  are  the  only 
recurrent  points,  then  with  positive  probability  any  recurrent  point  will 
lead  to  any  other  recurrent  point  in  exactly  4k  steps  -  we  go  to  0  as 
fast  as  possible,  stay  there  for  a  suitable  time,  and  come  back  to  the 
destination  as  fast  as  possible.  This  means  that  those  recurrent  points 
define  the  "positively  regular"  Markov  chain,  so  that  we  have  a  unique 
stable  probability  over  those  points;  see  Ref.  [9]. 

Example:   a  =  0.6,  6  =  0.7. 

Figure  3  shows  that  in  this  case  -1  ~  2  are  the  only  recurrent  points. 
The  transition  probabilities  between  these  points  can  be  easily  computed 
to  be: 


-1 

0 

1 

2 

The  (unique)  stable  probability  distribution  is: 

w_!=  10/204      i/o  =  H5/204      Bl  =  76/204      u2  =  3/204 
Its  mean  is 

-10  +  0  +  76  +  6         72  0.6 
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Similarly,  we  could  treat: 

Case  3:   0<6<1,  Ka<  1  +  6. 


204        1  +  0.7 


(=f) 
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But  actually  this  is  not  necessary,  because  the  symmetry  of  the  situ- 
ation with  respect  to  the  vertical  axis  of  Figure  3  and  the  translation  of 
x  axis  by  1  will  take  care  of  this  region  perfectly.  Thus  the  triangular 
area  labeled  0^2  comes  from  the  area  labeled  -1  ~  1  by  two  trans- 
formations: -1^2  from  another  -1  ~  2;  -1  ~  3  from  -2  ~  2;  and  so  on. 

This  completes  the  proof  of  the  result  in  the  previous  section. 


CONCLUSION 

The  discussions  of  "further  results' '  in  the  preceding  two  sections 
show  the  close  relationship  between  the  theory  of  stochastic  approxima- 
tion and  the  theory  of  numerical  convergence  with  random  rounding. 
Exploitation  of  whatever  practical  merits  this  idea  may  have  is  left  to 
the  future.  At  this  moment,  it  will  serve  at  least  as  another  interesting 
example  of  the  application  of  the  theory  of  Markov  chains. 
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HERBERT     ROBBINS 


Some  Numerical 
Results  on  a  Compound 
Decision  Problem 


When  statistical  decision  problems  of  the  same  formal  structure  occur 
in  large  groups,  rather  than  individually,  there  is  often  the  possibility  of 
reducing  the  total  risk  in  the  *  'compound' '  problem  by  using  a  decision 
procedure  in  which  the  decision  in  each  individual  problem  is  based  not 
only  on  the  random  variable  associated  with  that  problem  but  also  on  the 
random  variables  associated  with  the  other  problems  in  the  group.  The 
advantages  of  this  point  of  view  have  been  set  forth  on  several  occasions 
in  the  past  [1-3] .  Here  we  present  a  very  special  case  of  the  general 
procedure,  chosen  so  as  to  reduce  to  a  minimum  the  mathematical  appara- 
tus required  to  obtain  precise  numerical  results. 

Imagine,  then,  that  we  are  confronted  with  the  following  situation. 
There  are  two  states  of  nature,  represented  by  the  symbol  0,  which  can 
assume  only  the  values  0  and  1.  We  do  not  know  the  true  value  of  6,  but 
we  can  observe  a  random  variable  x,  also  having  only  two  values,  0  and  1, 
and  such  that 

if  6  =  0  then  x  =  0  with  certainty, 

if  0  =  1  then  x  =  1  with  probability  p  and  x  =  0  with  probability  q    = 

1-P, 
where  0  <  p  <  1  is  a  known  parameter.    After  observing  x  we  are  required 
to  guess  whether  6  =  0  or  1.   If  we  guess  correctly  we  lose  nothing,  while 
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if  0  =  1  and  we  guess  that  0  =  0  we  lose  an  amount  a, 
if  0  =  0  and  we  guess  that  0  =  1  we  lose  an  amount  b, 

where  a  and  b  are  known  positive  constants.  Clearly,  if  we  observe  x  =  1, 
we  should  guess  that  0=1.    The  problem  is  what  to  do  when  x  =  0. 
Accordingly,  we  introduce  the  two  decision  procedures: 

DQ:   if  x  =  0  guess  that  0  =  0,  if  x  =  1  guess  that  0  =  1. 
D  ji   guess  that  0=1  irrespective  of  the  value  of  x. 

The  expected  losses  due  to  incorrect  decision  in  using  D0  or  Dx  are 
given  by 

R  (D  0,0)  =  expected  loss  in  using  D0  when  0  =  0  =0 

R(D  x,l)  =  expected  loss  in  using  D 1  when  0=1  =0 

R(D  j,0)  =  expected  loss  in  using  D  j  when  0  =  0  =  b 

R(D 0,1)  =  expected  loss  in  using  D0  when  0=1  =  aq 

More  generally,  suppose  we  consider  using  a  so-called  randomized  de- 
cision procedure  Ds  defined  as  follows.  We  perform  an  auxiliary,  random 
experiment  having  nothing  to  do  with  0  or  x  and  with  two  outcomes,  0  and 
1,  such  that  P(l)  =  8  and  P(0)  =  1  —  8.  We  then  use  D0  if  the  outcome  is 
0  and  Dx  if  the  outcome  is  1.  Here  8  can  be  any  constant,  0  <  8  <  1. 
When  5  =  0,  Ds  =  D0,  and  when  5  =  1,  Ds  =  Dx  consistent  with  the  pre- 
vious notation.  We  now  have  at  our  disposal  a  family  of  decision  pro- 
cedures Ds ,  0  <  8  ^  1,  with  expected  losses  given  by 

R(D%,0)  =  expected  loss  in  using  Ds  when  0  =  0      =  b8 
R(D%,1)  =  expected  loss  in  using  Ds  when  0  =  1      =  aq(l  —  8) 

Our  problem  is  now  how  to  choose  8. 

Suppose  that  the  problem  of  guessing  the  value  of  0  from  observation  of 
x  is  presented  to  us  a  total  of  n  times: 


{ 


«„     9,   -   ,   <>n 


2> 


where  each  0f  is  either  0  or  1  and  the  probability  distribution  of  x,  de- 
pends on  0f  as  before.  We  are  now  required  to  guess  for  each  i  whether 
0f  is  0  or  1,  and  we  assume  the  same  loss  for  incorrect  decision  as  be- 
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fore.   If  we  use  the  decision  procedure  Ds  for  each  of  the  n  decisions  our 
average  expected  loss  over  the  n  decisions  will  be 


(l-f)   R(DB,0)+^R(Dz,  1) 


where 

n 

m  =    Zv    #i  =  number  of  l's  in  the  sequence  0x,...,6n 

i  =  1 

Accordingly,  we  define  for  any  0^5,  £^1  the  function 

f(4f)-(l-f)-»CD8,0)  +  fRO%;i) 
=  (1  -  f )  b8  +  £jg(l  -  5) 

and  can  now  state  the  following:  If  we  use  the  decision  procedure  Ds  for 
each  of  the  n  decisions,  and  if  m/n  =  £  where  m  is  the  number  of  l's 
among  0i,...,dn,  then  the  average  expected  loss  over  the  n  decisions  will 
be  f{8,0. 

We  observe  that  for  any  fixed  0  <  8  ^  1  the  equation  y  =  f{8,  £ )  defines 
in  the  £  y  plane  a  straight  line  through  the  point  (£o,yo)  whose  coordi- 
nates are 

A    -        b  v    -     abq 


aq  +  b  aq  +  b 

and  of  slope  aq  -  (aq  +  b)  8.    In  fact  we  can  write  the  equation  y  =  f(£,8) 
in  the  form 

y  -  y0  =  [aq  -(aq  +  b)8]{£-  (0)  (1) 

In  particular,  for  8  =  80  = ,  the  line  becomes  the  horizontal  line 

aq  +  b 

y  =  y0. 

Suppose  now  that  f  =  m/n  is  known  to  us.     Then  it  is  clear  from 
Equation  (1)  that  the  choice  of  8  should  be  made  as  follows: 

if      £<f0      use  D0 

if      £>£0      useDx 

if      f-f0      use  any  Ds,  0<5<1 

By  doing  so  we  incur  the  minimum  possible  average  expected  loss,  given 
by  the  function 
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In  order  to  attain  this  minimum  it  is  not  necessary  to  know  the  precise 
value  of  £  but  only  whether  it  is  greater  than  or  less  than  f  0.  If  nothing 
is  known  about  f  we  may  wish  to  use  DSo  and  incur  the  constant  average 
expected  loss  y0,  which  has  the  "minimax"  property  that  for  every 
0  <  8  < 1, 

max  [b8,  aq  (1  -  8)]  =  max  f{8,  #  ^  max  f{80,  O  =  y0 

Note  that  there  is  no  decision  procedure  which,  when  we  are  completely 
ignorant  of  the  value  of  £,  can  assure  us  for  every  £  of  the  minimum 
average  expected  loss  is  R(<f). 

The  results  obtained  so  far,  however,  strongly  suggest  the  following 
procedure  in  the  case  where  all  the  values  xv...,xn  are  at  hand  before 
the  individual  decisions  have  to  be  made: 

1.  From  x !,...,%„  estimate  £=  m/n. 

2.  Then  use  D0  or  D1  according  as  £  .<  £0  or  £  >  f0,  where  £is  an 
estimate  of  f  based  on  x i,..., xn. 

We  shall  now  show  how  such  a  procedure  can  be  constructed  and  eval- 
uated. 
As  to  the  estimation  of  £  =  m/n,  we  observe  that 


so  that,  setting  s  =  xx  +  ...  +  xn,  we  have 


fo  if  (9,  =  0 


E(s)  =  mp, 

m  =    y     6.  =  number  ol 

i  =  1 

Hence,  setting 

np 

we  have 

E(|)   =   f 

Var  |  -    mP1    <  q  x  ! 
nV        P      n 

uniformly  for  all  0U. 

:,0„- 

60  HERBERT  ROBBINS 

Using  D0  or  Di  according  as  f  <  «f 0  or  f  >  f 0  is  more  or  less  equiv- 
alent to  the  following  rule  D*.  Let 


r  =  \np  £ 


■j-r-^L-1 

Lag  +  bj 


[x]  =  greatest  integer  contained  in  x.   Define 

if  s  <  r    use  D  0 
D*< 


jf  s  >  r    use  D  i 

We  shall  now  evaluate  the  average  expected  loss  involved  in  using  this 
new  procedure  D*.  Note  that  D*  is  a  randomization  between  D0  and  Du 
but  with  the  randomization  probability  being  a  function  of  6u...,dn. 

If  61  +  ...  +  0n  =  m  and  xx  +  ...  +  xn  =  k,  then  the  total  loss  incurred  in 
the  n  decisions  is 


{; 


a(m-k)  for0<fe<r-l 

bin  -  m)  f or  r  <  k  <  m 

Hence  the  average  expected  loss  using  D*  when  di  +  ...  +  6n  =  m  is 

r-  1 


L(m,n)=i  t  <-» (r) p^-^b (1-^)  tp^ 

=  <K?^[i-E(m-l,r,  p)]   +   bA^E^r.p; 


where  we  have  set 

E(m,r,p)   =      V  r\pkqm-k 


■  SO 


a  function  which  has  been  extensively  tabulated.   Note  that 

Efro,  r,  p)  =  1  -  E(m,  m  -  r  +  1,  q) 

m-1 


E(m 


l,r,p)+(^        )pram"f 
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Thus  we  can  write 

Urn,  n)  =  aq™Eim  -  1,  m  -  r,  q)  +  6  (l  -^)[1  -  Eim,  m  -  r  +  1,  q)] 

=  aq  f  f1  +  m(i\  PrQm~r]+  [5  "  (a(?  +  b)  If]  x  E{m>  r>  P> 

We  would  expect  the  greatest  deviation  between  L(m,  n)  and  R  (  —  J  to 

occur  when  m/n  is  in  the  neighborhood  of  £0  =  b/(aq  +  b).  Thus,  setting, 
m  =  m0  =  7if0  and  r  =  7ipf0  (assuming  for  simplicity  that  n£Q  and  /ipf0  are 
integers),  we  obtain  from  Equation  (2)  that 

L0no^  =  a^ 

It  is  not  difficult  to  prove  the  following 

Theorem.    Given  e  >  0,  there  exists  an  rc(<r)  such  that  tz  >  nie)  implies 
that 

Lim,  m  :.  ti  |j  —  |]  +  f 


n)<R(^) 


uniformly  for  m  =  0,  l,...,n. 

In  other  words,  as  n  ->  <»,  the  procedure  D*  asymptotically  attains  the 
average  expected  loss  R(m/n)  uniformly  for  all  2"  vectors  idu...,6n). 
Thus  by  using  D*  we  can,  if  n  is  large,  do  almost  as  well  as  if  we  knew 
f  =  m/n  and  used  the  rule  D0  or  Dx  according  as  £  <  £0  or  £  >  fQ. 

Here  are  some  numerical  results  in  a  special  case.  Let  a  =  30,  6  =  1, 
p  =  0.7,  and  n  =  1,000.    Then  q  =  0.3,  <f0  =  0.1,  y0  =  0.9,  r  =  70, 


R 


f9f      for0<f<0.1 
{l-  ffor0.1<f<l 


and  from  Equation  (2) 


Urn,  n)  =  L(m,  1,000)  =  -M_  E(m  -  1,  m  -  70,  0.3) 
1,000 


(l--B-\ 

\       1,000/ 


[1  -  Eim,  m  -  69,  0.3)  ] 


Rounding  off  to  the  third  decimal  place  we  have  from  tables  of  Eim,  r,  p) 
the  results  shown  in  Table  1. 
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TABLE  1 


m 

R       m 
1,000 

L(m,  1,000) 

L(m,  hOOO)-RT^—- 
1,000 

0 

0.000 

0.000 

0.000 

• 

• 

• 

• 

• 

• 

• 

• 

• 

• 

• 

• 

70 

0.630 

0.630 

0.000 

80 

0.720 

0.720 

0.000 

90 

0.810 

0.832 

0.022 

92 

0.828 

0.864 

0.036 

94 

0.846 

0.896 

0.050 

96 

0.864 

0.924 

0.060 

98 

0.882 

0.944 

0.062 

99 

0.891 

0.951 

0.060 

100 

0.900 

0.955 

0.055 

101 

0.899 

0.956 

0.057 

102 

0.898 

0.956 

0.058 

104 

0.896 

0.946 

0.053 

106 

0.894 

0.938 

0.044 

108 

0.892 

0.926 

0.034 

110 

0.890 

0.914 

0.024 

120 

0.880 

0.882 

0.002 

130 

0.870 

0.870 

0.000 

• 

• 

• 

• 

• 

• 

• 

• 

• 

• 

• 

• 

1,000 

0.000 

0.000 

0.000 

The  results  obtained  in  the  preceding  discussion  can  be  generalized  in 
several  directions.  For  example,  we  may  test  two  arbitrary  "simple" 
hypotheses  about  the  probability  distribution  of  a  random  variable  x. 
Moreover,  the  decision  on  the  ith  parameter  value  0i  can  be  made  to  de- 
pend only  on  xu...,xt.  The  extension  to  these  cases  is  described  in  the 
joint  paper  with  Miss  Ester  Samuel,  which  follows. 
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Testing  Statistical  Hypotheses— 
The  "Compound"  Approach 


The  usual  approach  in  testing  a  simple  hypothesis  against  a  simple 
alternative  is  based  on  the  assumption  that  the  problem  must  be  faced 
only  once.  However,  it  often  happens  that  the  same  problem  occurs  not 
once  but  many  times.  Advantages  might  then  be  gained  by  considering 
the  whole  collection  pf  problems  as  a  totality.  This  paper  summarizes, 
without  proofs,  recent  work  in  statistical  decision  theory  which  shows 
that  this  is  indeed  the  case.  Details  and  proofs  can  be  found  in  Refs. 
[1-4]. 

The  basic  decision  problem  which  we  consider  is  the  following.  There 
are  two  possible  * 'states  of  nature,' '  symbolized  by  a  parameter  0  that 
can  assume  only  the  values  0  and  1.  The  value  of  6  is  unknown,  but  an 
observable  random  variable  x  with  values  in  some  space  X  has  a  proba- 
bility distribution  Pe  which  depends  on  0  in  a  known  way.  We  are  re- 
quired, on  the  basis  of  a  single  observation  of  x,  to  decide  whether  0  is 
0  or  1.  If  in  fact  6=1  but  we  wrongly  guess  that  0  =  0  we  incur  a  penalty 
or  loss  of  a  units,  and  if  6  =  0  and  we  guess  that  0  =  1  we  lose  b  units, 
where  a  and  b  are  given  positive  numbers.  A  (randomized)  decision 
function  is  a  function  t  =  t(x)  defined  for  all  x  in  X  and  such  that 
0  <  t(x)  <  1;  the  corresponding  decision  rule  is  to  guess  "0  =  1"  and 
"0  =  0"  with  respective  probabilities  t(x)  and  1  —  t(x)  when  x  is  ob- 
served, the  randomization  being  accomplished  by  a  table  of  random 
numbers  or  some  other  extraneous  device.  Our  problem  is  how  to  assign 
the  values  t(x)  in  such  a  way  as  to  minimize  in  some  sense  the  loss  due 
to  incorrect  decision. 


♦Research  sponsored  by  the  Office  of  Naval  Research,  under  Contract  No. 
Nonr-266(59),  Project  No.  NR  042-205  and  Contract  No.  Nonr-266(33),  Project 
No.  NR  042-034. 
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This  problem  has  been  dealt  with  so  extensively  in  the  statistical 
literature,  notably  by  Neyman  and  Pearson,  Wald,  and  the  Bayesians,  that 
it  might  seem  that  nothing  new  remains  to  be  said.  We  shall  summarize 
the  usual  approach. 

Associated  with  each  decision  function  t  are  the  two  values  of  the 
"risk  function"  R(t,  0)  defined  as  the  expected  loss  in  using  t  when 
6{  =  0  or  1)  is  the  true  parameter  value: 

fbE0(t(x))  for   (9  =  0 

R(t,  6)=J  (1) 

\cti-EiMx))]      for   0=1 

where  Ee  denotes  expectation  with  respect  to  the  probability  distribution 
Pe.  The  smaller  the  values  of  R(t,0)  and  R(t,l)  the  better  the  decision 
function  t.  For  any  real  number  f  such  that  Q£££l,  we  define  the  weighted 
average  of  the  two  expected  losses 

R(t,£)   =   {Rit,D  +  (l-£)   R(t,0)  (2) 

Any  decision  function  t  that,  for  a  fixed  £  minimizes  the  value  of  R(t,£) 
is  said  to  be  Bayes  with  respect  to  £;  and  will  be  denoted  by  tp  . 

The  usual  intuitive  meaning  associated  with  Equation  (2)  is  the  fol- 
lowing. Suppose  that  the  statistician  knows  (or  feels)  that  the  unknown 
parameter  0  is  in  fact  a  random  variable  with  specified  a  priori  probabili- 
ties f  and  1  -  £  of  taking  on  the  values  1  and  0,  respectively.  Then  for 
any  decision  function  t  the  "global"  expected  loss  will  be  given  by 
Equation  (2),  and  hence  it  will  be  reasonable  to  use  the  decision  func- 
tion tg  which  minimizes  R(t,g). 

To  find  tc  explicitly^  we  may  assume  without  loss  of  generality  that 
the  two  distributions  Pe  of  x  are  given  in  terms  of  density  functions 
fix, 6)  with  respect  to  a  fixed  measure  (i  on  X,  so  that  for  any  event  S, 


Pe(S)=     f  f(x,d)du        (0  =  0,1) 
Js 


(3) 


Then  from  Equations  (1)  and  (2)  we  have 

R(t,£)  =  £a[l-E1  (t(x))]  +  (1  -  0bEo(t(x)) 

=  &+   J  [(1  -  €)bf(x,0)-faf(x,l)]t (x)dfi 


(4) 


Clearly,  the  last  expression  is  minimized  by  minimizing  the  integrand 
for  each  x  in  X;  that  is,  by  defining  t(x)  to  be 
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r       1         ifa-f)6f(x,0)<£xf<x,l) 

tg<x)  =  1        0         if  (1  -  OW (x,0)  >  faf (x,l)  (5) 

arbitrary   if  (1  -  f  )bf(x,0)  =  &f(x,l) 
^  in  [0,1] 

We  may,  for  example,  use  the  nonrandomized  version  of  Equation  (5) 
given  by 

riif<l-<fW(x,0)<$tf(x,l) 
*£(x)=J  (6) 

1 0  otherwise 

The  function  R  (•)  with  values 

R(f)  =  R(^,<f)  =  min  R(t,&  (7) 

is  called  the  Bayes  envelope  function,  and  is  essential  to  what  follows. 
It  represents  the  minimal  global  expected  loss  attainable  by  any  decision 
function  when  6  is  a  random  variable  with  a  priori  distribution  P  [6  =  1]  =  f, 
p[0  =  O]  =  l-f. 

In  a  £  y  plane,  the  curve  y  =  ft(^)  is  the  envelope  of  the  one-parameter 
family  of  straight  lines  y  =  R(ta,0  for  0  <  a  <  1.  By  Equation  (7),  for 
each  fixed  a,  the  curve  y  =  R(£)  lies  entirely  below  the  straight  line 
y  =  R(ta,£ ),  and  the  curve  y  =  R{£)  is  concave  and  continuous,  as  shown 
in  Figure  1. 

y 


ft(ta,o) 


Figure  1.    Bayes  envelope  function. 
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Intuitively  speaking,  R(ta,0  may  be  regarded  as  the  expected  loss 
incurred  by  a  Bayesian  who  thinks  that  P[0  =  1]  =  a  and  hence  uses  the 
decision  function  ta,  when  in  fact  P[0  =  1]  =  g;  the  excess  of  R(ta,£;) 
over  R{£)  is  the  cost  of  his  error  in  incorrectly  estimating  the  true  value 
of  the  a  priori  probability  f  =  P[6  =  1] . 

Particular  interest  attaches  to  the  use  of  the  decision  function  t^  for 

which  R(*aM,0)  =  R((aM,l)  =  max  R(£).     This  "minimax"  rule  has  the 
property  that  for  any  other  rule  t, 

max  (R(taM,0),R(taM,U)  <  max  (R(*,0),  R(t,D) 

so  that  the  maximum  possible  expected  loss  due  to  ignorance  of  the  true 
state  of  nature  is  minimized  by  using  taM. 

Suppose  now  that,  instead  of  being  confronted  with  a  single-  decision 
problem  of  the  type  just  described,  we  are  faced  with  a  finite  sequence 
of  n  decision  problems  of  the  same  structure.    Thus,  let 

ei #n;  6>f  =  0  or  1 

(8) 
xx,...,xn 

be  a  finite  sequence  of  parameter  values  and  corresponding  observable 
random  variables,  where  xt  is  distributed  according  to  P  e  and  is  inde- 
pendent of  the  other  x's  and  #'s.  Again,  the  problem  is  to  decide  for  each 
i  =  l,...,n  whether  6t  =  0  or  1. 

The  traditional  solution  of  this  "compound  decision  problem"  is  to 
choose  some  decision  function  t  as  described  above  and  use  it  to  decide 
about  each  6t  on  the  basis  of  xt  alone.  We  shall  call  such  a  rule  a 
"simple  symmetric  rule."  It  makes  no  use  of  the  fact  that  all  n  decision 
problems  have  the  same  structure.  A  fundamentally  different  approach 
to  the  compound  decision  problem  is  taken  in  Refs.  [1]  and  [2]  in  the 
case  in  which  all  the  values  xif...,xn  are  observed  before  the  individual 
decisions  are  to  be  made.  In  Ref.  [3],  this  approach  is  generalized  to 
the  "sequential"  case  in  which  the  decision  on  Oj  has  to  be  made  on  the 
basis  of  xu...fXi  only.    We  shall  sketch  the  main  results  of  Refs.  [1-3]. 

A  decision  rule  for  the  compound  case  will  be  denoted  by  T  =  (tlf...,tn) 
with  0  <  £;  <  I.  Here  tf,  the  probability  of  deciding  that  6t  =  1  when  the 
x's  are  observed,  now  may  be  a  function  not  only  of  xt  but  also  of  x  i,...,xn 
in  the  nonsequential  case,  or  of  xlt...,Xi  in  the  sequential  case.  Thus 
the  decision  on  Q{  will  in  general  depend  on  values  Xj  with  j  ^  i.  At  first 
sight  this  may  seem  unreasonable. 
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We  shall  take  for  the  risk  function  in  the  compound  case  the  function 

R(T;  el9...\0n)  =1  f"  R/T;  6lf...,6n)  <9) 

n    *-+ 

i-  i 

where  ft/(T;  61,...,dn)  is  the  expected  loss  on  the  ith  decision  in  using  T 
when  the  parameter  values  are  6u...,6n;  in  symbols, 


fb  /...  ftidP0{x1)  ...  dPe  (xn)  when  6t  =  0 
Ri(T;^1,...,^n)  =  ^  (10) 

U  [1  -  /...  J*,^  (xj)  ...  c?P^  (xn)]  when  0f=l 


all  the  integrals  being  taken  over  X.  In  the  special  case  of  a  simple 
symmetric  rule  with  tt  =  t(Xj)  for  all  i  =  l,...,n,  Equation  (9)  becomes 

n 

J-  y  Rct,^)  =  5"nR(t,i)  +  (i  -  onmt,o)  =  RU,on)        (id 

n    *—* 

i-  1 

where  R(t,  • )  is  defined  by  Equation  (2)  and 

n 

6n  =—  Y*'   ^f  =  proportion  of  l's  among  61,...,6n  (12) 

i=  1 

This  yields  a  new,  "non-Bayesian"  interpretation  of  Figure  1.  We  know 
that  R(t,0n)  is  minimized  with  respect  to  t  by  t-  ,  in  which  case Equa- 

8 n 

tion  (11)  becomes  R(0n).  Thus,  if  we  know  in  advance  of  the  experiment 
the  proportion  6n  of  l's  in  the  sequence  #i,...,#n,  we  can,  by  using  a 
simple   symmetric  rule  to  ,  obtain  a  risk  equal  to  the  Bayes  envelope 

n 

function  R(-)  evaluated  at  the  point  0n  and  no  other  simple  symmetric 
rule  can  do  better.  The  trouble  is,  of  course,  that  we  usually  will  not 
know  6n  either  before,  during,  or  after  the  experiment.  If  we  guess  the 
value  0n  to  be  a  and  use  ta  throughout  when  in  fact  6n  =  g ,  the  risk  will 
be  R(ta,£ ),  which  is  never  smaller  and  may  be  much  larger  than  R(<f);  see 
Figure  1. 

A  way  out  of  this  predicament  in  the  nonsequential  case  is  to  go  out- 
side the  class  of  simple  symmetric  rules  and  to  use  all  the  observations 
xv...,xn  to  estimate  0n  by  an  estimate  -qn,  and  then  use  t°   (x,)  todecide 

n 

on  et. 
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Thus,  by  Equation  (6)  we  would  set 

rl     if    a-r1r)bf(xi,0)<rlnaf(xi,l) 
ti(x1,...,xn)=    J  (13) 

^  0    otherwise 

One  possible  class  of  estimators  7/     is  the  following.     Let  S  be  any 
event  for  which  P0(S)  £  P1  (S),  and  set 


n    ifx 

\o    ifx 


eS 
gix)  =    1  (14) 

and 

h        m  g(x)-P0(S) 

P1(S)-  P0(S) 
Then  h  is  a  bounded,  unbiased  estimator  of  6: 

EJMx)]  =  6     for  (9  =  0,  1  (16) 

Hence 

kn  =  \      £     «*,)  d7) 

is  an  unbiased  estimate  of  dn  ,  and 

0      if  hn  <  0 

if  0<hn<  I  (18) 


1      ifhn>l 
will  be  such  that  0  <  r/n  <  1  and  for  any  8  >  0, 


Hm,  P^ «a[|iin-«nl>«  =  0 


(19) 


uniformly  for  all  2"  parameter  vectors  (#!  ,...,#„). 
For    any  such  estimator  77  n   let  T       denote  the  decision  rule  with 

n 

*i  =  £°     (xf),  i  =  l,...,rc.   We  can  now  state: 

n 

Theorem  1.    Given  any  6  >  0  there  exists  an  N(e)  such  that  for  every 
n  >  NW, 

K(T     ;    6>1,...,^n)</?W„)  +  e  (20) 

n 

uniformly  in  ( 0 1 , . ... ,  0n) . 
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The  truth  of  the  theorem  -  see  Ref.  [2]  -  follows  essentially  from 
Equation  (19)  and  the  fact  that  when  \(  -  a\  <8  (<r),  then  R(ta,0  <#(£)  +  e. 
See  Figure  1. 

REMARKS 

1.  Theorem  1  states  that  even  without  advance  knowledge  of  0n  one 
can,  for  sufficiently  large  n,  do  almost  as  well  as  the  simple  symmetric 
rule  ta    can  do,  and  this  is  true  uniformly  for  all  values  6l9...9dn. 

n 

2.  T       is,  however,  not  "admissible"  in  the  sense  of  Wald,  since  for 

n 

every  n  there  exists  a  rule  T'n  such  that  for  all  dl9...,dn, 

R(T';   6l9...90n)<R{Tv  ;    0l9...,6n)  (21) 

1  n 

at  least  one  of  the  2n  inequalities  being  strict.  An  explicit  description 
of  the  class  of  all  admissible  rules  has  been  given  in  Ref.  [3],  but  these 
generally  require  so  much  computation  that  for  practical  purposes  the 
rules  Tv    seem  preferable. 

3.  It  is  also  shown  in  Ref.  [3]  that  the  minimax  rule  for  the  (sequen- 
tial and  nonsequential)  compound  decision  problem  is  just  the  simple 
symmetric  rule  which  at  each  decision  uses  the  minimax  rule  ta    for 

which  R(taM,0)  =  R(taM,  1)  =  max  R(<f). 

4.  Numerical  values  of  the  differences  R(T     ;  6l9...,6n)  -  R(0n)  in 

n 

two  specific  cases  can  be  found  in  Refs.  [1]  and  [4]. 

Consider  now  the  sequential  case.  Here  we  can  use  at  the  zth  decision 
the  decision  function 

*/**...,*,) -l^fc,)  (22) 

where  rj  =  1/2  and  77.  for  i  >  1  is  defined  by  Equation  (18)  with  n  re- 
placed by  i.  Calling  this  rule  T*,  we  have  the  following  analog  to 
Theorem  1,  the  proof  of  which  is  given  in  Ref.  [3]. 

Theorem  2.  Assume  that  P0  and  Pj  are  such  that  dR(g)/dg  exists  for 
all  0  <  f  <  1.  Then  given  any  e  >  0  there  exists  an  N{e)  such  that  for 
every  n  >  N(e), 

R{T*;    dl9...9dn)<R{6n)  +  t  (23) 

uniformly  in  0l9...9dn. 
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Remarks  1-3  under  Theorem  1  apply  to  Theorem  2  as  well,  with  Tv 
replaced  by  T*. 

5.  Note  that  we  may  use  T*  without  knowing  n  in  advance.  The  risk 
R(T*;    61,...,0n)  will,  however,  depend  on  the  order  of  01,...,6n,  and  not 

only  on  0n  as  in  the  nonsequential   case,  which  makes  relation  (23) 
even  more  remarkable. 

6.  We  do  not  know  to  what  extent  Theorem  2  remains  true  when  R(g) 
does  not  have  a  derivative  (e.g.,  when  X  is  a  finite  set).  In  any  case, 
there  is  exhibited  in  Ref.  [3]  a  randomized  version  of  T*  that  satisfies 
relation  (23)  for  any  two  distributions  P0  and  Px. 

Similar  results  in  a  more  general  decision  problem  have  been  announced 
in  an  abstract  [5] . 

7.  In  the  rather  exceptional  case  in  which  d1,...,di_1  themselves  are 
known  at  the  time  the  ith  decision  is  to  be  made,  we  can  use  the  rule  Tn 
obtained  from  T*  by  replacing  t®        by  t%       (setting  00  =  1/2). 

Theorem  2  remains  valid  for  Tn,  as  does  remark  5. 


CONCLUSION 

It  is  perhaps  too  early  to  assess  the  practical  value  of  the  decision 
rules  T     ,  T*,  and  Tn,  and  various  more  or  less  obvious  modifications 

1  n  n 

of  them.  However,  Theorems  1  and  2  provide  strong  support  for  the  "com- 
pound" approach  when  n  is  large,  and  further  work  along  these  lines 
should  be  of  great  value. 


REFERENCES 

1.  Robbins,    H.    "Asymptotically    Subminimax   Solutions    of  Compound 

Statistical  Decision  Problems."  Proceedings  of  the  Second 
Berkeley  Symposium  on  Mathematical  Statistics  and  Probability, 
Berkeley:   University  of  California  Press,  1951. 

2.  Hannan,  J.  F.,  and  H.  Robbins.     "Asymptotic  Solutions  of  the  Com- 

pound Decision  Problem,"  Ann.  Math.  Stat.  26,  37-51  (1955). 

3.  Samuel,  E.    On  the  Sequential  and  Non-Sequential  Compound  Decision 

Problem.    Dissertation,  (Columbia  University,  1961). 

4.  Robbins,    H.      "Some  Numerical  Results  On  a  Compound  Decision 

Problem,"  this  volume,  pp.  56-62. 

5.  Hannan,  J.  F.    "The  Dynamic  Statistical  Decision  Problem  When  the 

Component  Problem  Involves  a  Finite  Number,  m,  of  Distribu- 
tions" (abstract),  Ann.  Math.  Statv  27,  212  (1956). 


H.  H.  GOODE* 


Deferred  Decision  Theory 


Some  of  the  earliest  notions  concerning  decision  making  between  two 
alternatives  on  the  basis  of  observations  of  a  stochastic  variable  were 
introduced  by  Laplace  [2].  In  his  view,  for  a  variable  representing  an 
outcome  that  is  either  a  success  or  a  failure,  the  quantity  to  be  calcu- 
lated is 

P=L±1  (1) 

n  +  1 

where  n  is  the  number  of  trials  and  r  is  the  number  of  successful  out- 
comes. If  p,  now  conventionally  considered  to  be  the  probability  of  suc- 
cess, is  large,  an  action  compatible  with  a  successful  outcome  is  indi- 
cated. If  p  is  small,  the  converse  action  is  indicated. 

Subsequently,  Bayes  [3]  constructed  a  model  of  decision  which,  for  the 
two-alternative  case,  sets  up  a  choice  to  be  made  between  two  hypotheses. 
The  choice  depends  on  values  of  a  stochastic  variable  to  be  observed. 
It  is  assumed  that  the  a  priori  probability  of  each  hypothesis'  being  true 
is  known.  The  distribution  of  the  observed-variable  values  when  either 
hypothesis  holds  is  also  known.  More  explicitly,  let  the  hypotheses  be 
Hi  and  H2  and  their  associated  a  priori  probabilities  z  and  1  -z, respec- 
tively; let  the  distribution  of  y,  the  observed  variable,  be  Piiy)  when  H1 
is  true  and  p2(y)  when  H2  is  true.    Then 

PJHJ- ^ <2) 

zpt(y)  +  (l-z)p2(y) 

is  Bayes'  formula,  where  PyiHx)  denotes  the  probability  that  if  y  was 
observed,  Hx  is  true. 

♦Professor  Goode  completed  this  paper  in  rough  form  before  his  untimely  death 
in  October  1960.  The  work  was  completed  by  T.  G.  Birdsall,  Mrs.  P.  Elliot, 
R.  A.  Roberts,  and  W.  Evans,  all  of  the  University  of  Michigan.  The  mathematical 
formulation  of  the  problem  is  that  of  Blackwell  and  Girshick  (Ref.  [l],  Section 
10.2).  This  work  was  supported  by  U.  S.  Army  Signal  Corps  Contract  No. 
DA-36-039  sc-78283  and  the  paper  completed  under  Office  of  Naval  Research 
Contract  No.  Nonr- 1224(36). 
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In  the  context  of  decision  making,  this  formula  incorporates  the  follow- 
ing parameters: 

1.  The  difference  in  value  between  alternatives,  if  these  can  be  ex- 
pressed quantitatively. 

2.  Prior  knowledge  about  the  hypotheses. 

3.  The  distribution  of  outcomes  under  each  hypothesis. 

Another  factor,  which  is  important  in  making  decisions  and  is  not  accounted 
for  by  the  formula,  is  the  set  of  losses  and  gains  associated  with  the 
several  possible  combinations  of  hypotheses  and  actions  taken;  for 
example,  an  action  taken  compatible  with  H2  when  Hx  is  true. 

Because  of  widespread  misunderstanding  about  the  meaning  and  use  of 
the  a  priori  probabilities,  Bayes'  formula  fell  into  disuse.  Its  place  was 
taken  by  a  considerable  amount  of  confused  thinking,  represented,  for 
example,  by  the  view  that  the  population  mean  is  distributed  about  the 
sample  mean  instead  of  the  converse. 

Into  this  state  of  affairs,  Fisher  [4,5]  injected  precise  statements  con- 
cerning inferences  to  be  drawn  from  observations.  He  defined  the  null 
hypothesis  as  one  of  interest  whose  validity  was  to  be  tested.  He  re- 
jected the  null  hypothesis  when  sets  of  observations  occurred  that  would 
be  improbable  if  it  were  true.  The  level  of  improbability  at  which  the 
null  was  to  be  rejected  was  labeled  the  level  of  significance.  However, 
the  alternative  hypothesis  was  not  in  evidence. 

Neyman  and  Pearson  [6,7]  reintroduced  the  alternative  hypothesis. 
They  stated,  reasonably,  that  if  one  alternative  is  rejected,  another  is 
accepted.  This  might  be  a  complex  of  alternatives  all  considered  at 
once.  This,  in  turn,  led  to  the  recognition  of  two  types  of  error:  accepting 
H2  when  Hi  is  true  (Type  I),  and  accepting  Hx  when  H2  is  true  (Type  II). 
Because  the  probability  of  each  type  of  error  could  be  calculated  if  the 
decision  policy  relative  to  values  observed  were  stated,  it  was  possible 
to  make,  in  some  sense,  an  optimum  decision.  Neyman  and  Pearson  sug- 
gested holding  a,  the  Type  I  error  probability,  constant  and  so  choosing 
the  decision  mechanism  that  /3,  the  probability  of  a  Type  II  error,  would 
be  minimized. 

For  example,  consider  the  situation  represented  in  Figure  1,  where 
pt(y)  and  p2(y)  are  both  normal  with  different  means,  and  a  single  obser- 
vation is  made.  If  one  decides  to  accept  H2  whenever  y  lies  in  the  in- 
terval y1<y<y2,  then 

fy2 
a=   J         Pi(y)dy 


and 


1-/5=   j        P2(y)dy 
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These  areas  are  indicated  for  two  different  decision  intervals  in  Figure  1. 
When  the  observation  falls  in  a  region  a,  which  is  fixed  in  area,  H2  will 
be  accepted;  otherwise  H  i  will  be  accepted.  The  only  restriction  on  the 
choice  of  a  is  its  fixed  area.  Minimizing  fi  is  equivalent  to  maximizing 
1-/3.  It  is  intuitively  clear  that  the  a  region  in  the  right  tail  of  p  i(y) 
maximizes  1  -  /3  in  p2(y)«  It  is  also  rigorously  correct.  Thus,  Neyman 
and  Pearson  had  restored  the  alternatives  in  the  two-alternative  decision 
but  had  made  no  use  of  prior  knowledge. 


ACCEPT  M2 
^?a  REGION 


r\ty 


y^(\-j3)   REGION 


ACCEPT  ff2 

A 


(\-$)    REGION 


y^a  REGION  y2 


AT  oo 


Figure  1.    Two  choices  of  the  a  region  and  the  corresponding  1-/3  regions  for 
two  normal  distributions. 


Wald  [8]  took  a  major  step  in  his  treatment  of  statistical  decision  func- 
tions. Reintroduced  three  new  characteristics  of  the  decision  and  unified 
the  entire  treatment.   These  three  new  characteristics  are: 

1.  The  introduction  of  the  costs  of  errors  as  determinants  of  the  sizes 
of  a  and  /S.  It  remains  true  that  there  are  two  types  of  error,  but 
this  fact  is  less  important  than  the  losses  due  to  each. 

2.  The  inclusion  of  the  alternative  of  putting  off  decisions  until  it 
pays  to  make  a  decision.  Up  to  this  point,  the  decision  was  assumed 
to  be  made  once  and  for  all  after  observations  were  taken.  Wald  in- 
troduced the  clearly  desirable  notion  of  waiting  until  making  a  de- 
cision is  worthwhile. 

3.  The  introduction  of  the  possibility  that  an  opponent  is  determining 
the  situation.  This  led  to  the  minimax  criterion  that  Von  Neumann 
[9]  had  introduced  earlier,  and  the  result  was  shown  to  be  equiva- 
lent to  the  game  solution  for  zero-sum  two-person  games.  Wald 
used  the  expected  value  as  a  criterion  for  action. 

In  the  present  paper,  we  are  concerned  with  the  case  of  a  two-alterna- 
tive decision  in  which  it  is  allowable  to  defer  decision  (deferred  decision 
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case)  until  further  observations  have  been  made.  The  problem  is  that  of 
determining  the  course  of  actions  which  will  minimize  expected  losses. 
The  major  objectives  of  the  paper  are  twofold:  to  set  forth  the  solution 
of  this  problem  as  simply  as  possible  so  as  to  exhibit  its  implications 
for  practical  decision  making,  and  to  provide  a  means  of  determining 
numerical  answers  for  special  cases.  To  our  knowledge,  this  has  not 
been  done  previously. 

We  begin  with  a  short  review  of  the  two-alternative  decision  under  the 
approach  used  in  present-day  statistical  decision  theory.  Then  the  prob- 
lem is  restated  in  deferred  decision  terms.  An  iterative  algorithm  is 
produced  for  the  calculation  of  decision  points  if,  in  fact,  they  exist. 
The  existence  of  decision  points  is  proved  by  proving  convergence  for 
the  algorithm. 

Since  the  formulas  are  difficult  to  handle  analytically,  a  computer  pro- 
gram has  been  written.  The  flow  diagram  for  the  program  and  a  summary 
of  the  results  for  Gaussian  distributions  and  some  specific  numerical 
costs  are  given.  Finally,  some  conclusions  are  drawn  regarding  the 
usefulness  of  the  results. 


STATISTICAL  DECISION  -  TW-ALTERNATIVE 

Suppose  we  have  two  possible  alternatives,  Hx  and  H2,  which  we  know 
from  prior  knowledge  have  a  probability  of  materializing  of  z  and  1  —  z, 
respectively.  To  make  decisions,  we  observe  a  variable  y  whose  probabil- 
ity distributions,  if  Hf  is  true,  are  p{iy)  with  i  =  1,  2.  We  also  know  the 
costs  of  making  each  of  the  possible  errors.  The  cost  is  co12  for  taking 
action  A2  consistent  with  H2  when  in  fact  Hx  is  true;  it  is  &>2i  f°r  taking 
action  Ax  consistent  with  Ht  when  in  fact  H2  is  true.  The  gains  from 
making  correct  decisions  have  been  normalized  at  zero  with  no  loss  of 
generality.  Our  problem  is  to  choose  to  take  A  x  or  A 2  on  the  basis  of 
an  observed  y  in  such  a  fashion  that  the  expected  loss  is  minimized. 

Our  choices  may,  of  course,  be  made  under  several  different  criteria. 
We  shall  deal  with  the  criterion  of  minimizing  the  expected  loss  as  being 
mathematically  tractable,  of  frequent  occurrence,  and  reasonable  in  many 
cases. 

After  the  observation  of  a  value  of  y,  the  state  of  our  knowledge  of  the 
truth  of  Hi  will  be  given  by  Equation  (2).  It  states  the  newly  calculated 
probability  that  H  x  is  true,  Py  (H  i). 

The  probability  (from  our  view)  that  Hi  is  true  is  PyiHJ,  and  the  ex- 
pected loss  for  taking  A2  is  Py(H1)o)12.  The  expected  loss  for  taking 
action  Ax  is  Py(H2)o)21  =  [1  -  Py(H1)]a)21.  These  lines  are  plotted  in 
Figure  2.  The  double-ruled  portions  of  these  curves  are  the  smaller  loss 
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Figure  2.    Expected  loss  for  deciding  onAj  and  A2. 

parts,  and  if  we  are  forced  to  be  on  one  of  these  curves,  we  should  try 
to  be  on  the  double-ruled  portions.  This  can  be  accomplished  by  taking 
A  2  when  Py(Hi)  <  y  and  Ax  otherwise,    y  is  given  by 


CO 


y  = 


21 


12 


+     CO 


(3) 


21 


Some  notes  should  be  made  about  this  result  and  its  relation  to  other 
terminology.  Frequently,  the  likelihood  ratio,  <£(y)  k  P2(y)/pi(y),  is  of  in- 
terest. Py(Hi)  is  monotonic,  decreasing  with't(y)  as  shown  by  substitu- 
tion into  Equation  (2): 


PyiH,) 


or 


Uy)  = 


z  +  (l-z)<l(y) 

Zll-PyMj] 

(l-z)PJH1) 


(4) 


(5) 


Z(y)  goes  from  <x>  to  0  as  PyiHJ  varies  from  0  to  1.    The  decision  point 
in  £(y)  is 


Uy)  = 


CO 


(6) 


z     co 


21 


i.e.,  take  A2  when^y)  >l{y). 

Further,  one  frequently  deals  with  four  values:    the  two  costs  already 
stated  plus  the  gains  associated  with  taking  A  2  when  in  fact  H2  is  true, 
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and  the  same  for  AA.  Let  these  gains  be  cb22  and  colv  Then,  if  one 
calculates  expected  value  taking  these  into  account,  o  x  2/cd  2 1  is  replaced 

by  — — LL  .    Since  the  gains  are  negative  costs,  choosing  con  and 

^22  as  zero  is  merely  the  choice  of  new  origins  from  which  to  measure 
costs.  We  still  retain  freedom  to  choose  a  unit  of  measure  for  cost  state- 
ments.  We  shall  reserve  its  use  until  later. 

Finally,  we  note  that  y  can  be  representative  of  a  larger  number  of 
observations  than  just  one.  In  case  several  observations  are  made, 
Pi(y)  and  p2(y)  must  be  calculable.  For  example,  for  independence  and 
a  single  distribution,  p^y)  =  p  i(y  Jp  i(y  2) .  •  •  Pi^*),  where  k  observations 
are  made.  It  is  emphasized  that  the  state  of  our  knowledge  before  seeing 
y  is  z  and  1  -  z,  whereas  after  seeing  y,  it  is  Py{Hx)  and  Py(H2).  We 
make  the  decision  after  seeing  y.  If  we  had  to  make  it  before  seeing  y, 
the  same  cutoff  point,  PyiHx)  <  y,  would  hold,  but  Py(Hx)  would  be  re- 
placed by  z. 
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We  shall  now  examine  the  following:  The  situation  is  the  same  as 
in  the  two-alternative  case,  except  that  we  are  told  after  arriving  at  the 
state  of  knowledge  Py{Hi)  that  we  do  not  have  to  make  a  decision,  we 
can  hold  off  for  another  observation  or  for  as  many  as  we  like  until  a  to- 
tal of  n  have  been  taken.  The  penalty  for  taking  observations  is  the  cost 
of  an  observation,  which  we  take  as  1  (the  unit  of  loss  measure).  Whether 
we  should  delay  decision  will  depend  on  whether  the  expected  loss  for 
taking  another  observation  is  greater  than  for  making  the  decision  now. 
But,  of  course,  the  expected  value  of  taking  another  observation  will  de- 
pend on  the  expected  value  of  taking  another  one  beyond  that,  etc.,  until 
we  have  exhausted  our  right  to  take  observations.  In  fact,  the  expected 
value  of  taking  one  more  will  depend  on  how  many  we  may  be  allowed, 
the  actual  value  being  the  result  of  a  complicated  nesting  process. 


•••The  mathematical  formulation  of  the  deferred  decision  problem  is  given  in 
Section  10.2  of  Blackwell  and  Girshick  [l].  The  following  is  quoted  from  that 
section: 

It  is  to  be  observed  that,  while  the  averaging  process  is  laborious  from  a  com- 
putational point  of  view,  the  fact  that  the  determination  of  the  stopping  regions 
and  the  Bayes  risk  involves  nothing  more  complicated  than  taking  expectations 
is  of  theoretical  interest.  Also  this  method  can  be  considered  an  iterative 
procedure  for  obtaining  the  Bayes  risk  and  the  stopping  regions  of  the  non- 
truncated  sequential  procedure. 
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GENERAL  THEORY 


To  start  the  process,  suppose  no  observations  may  be  taken.  That  is, 
a  decision  must  be  made  at  once.  If  T(z)  is  the  expected  loss  for  a 
terminal  decision,  then  the  expected  loss  E0(z0)  is, 


E0(z0)  =  T{z0)  =  z0o12 

=  (1  —  z0)o)2i 


where 


0  <z0  <y0 
y0  <z0  <  1 


(7) 


yo  = 


Ci>21  +  0} 


12 


to\2>    Cl)21    >  0 

The  above  follows  from  the  fact  that  the  probability  of  H 1  being  true  is 
z0  and  the  cost  of  taking  action  A2  is  <y12,  so  that  the  probability  of  a 
loss  when  action  A2  is  taken  is  20  and  the  expected  loss  is  20^12* 
Similarly,  the  probability  of  H2  being  true  is  (1  -  z  0),  the  cost  of  taking 
action  Ax  is  in  this  case  co21,  and  the  expected  loss  is  (1  —  z0)  oj2i- 
A  plot  of  these  two  expressions  as  a  function  of  z0  is  shown  in  Figure  3. 
Actions  should  be  taken  as  indicated  therein. 


TAKE 
ACTION  A2 


T^KE 
ACTION 


Figure  3.    Expected  losses  making  up  E0(z),  the  terminal  decision 
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Suppose  now  that  the  option  is  offered  of  one  further  observation.  The 
expected  loss  curve  for  each  value  of  z  1  in  this  case  must  be  calculated. 
For  fixed  zlf  the  probability  that  the  value  actually  observed  is  y  is 
given  by 


p(y)  =  z1p1(y)  +  {l-z1)p2(y) 


(8) 


If  y  is  observed,  the  probability  that  Ht  is  true  can  be  calculated  from 


Py(H1)  =  z1 


Pi(y) 

p(y) 


(9) 


The  observer  will  then  be  in  a  state  such  that  no  more  observations  are 

permitted  and  the  probability  of  Ex  is  Py(Hi).    The  expected  loss  for  any 

given  observation  y  is  E0[Py{Hi)].    For  all  possible  y,  the  expected  loss 

is     2     p(y)E0[Py(H1)].    Thus  the  expected  cost  of  deferring  a  decision 
all  y 

G1(z)=   1+    p(y)E0[P7(Hi)]  +  l  (10) 

all  y 

Conjoining  the  expected  values  of  not  taking  an  observation  at  all, 
T(z)  ,  and  taking  the  one  observation,  G\(z)  ,  the  curves  appear  as  in 
Figure  4.  Alternatives  should  be  chosen  as  shown  therein.  The  opti- 
mized expected  loss  curve  is 


E1(z)=min[T(z)>  G^z)] 


(11) 


TAKE  ACTION  A2         DEFER       TAKE  ACTION    A] 


Figure  4.    E^z),  G^z),  E0(z)  as  functions  of  z. 
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Suppose  now  two  observations  may  be  taken  and  the  observer  must  de- 
cide whether  to  stop  and  decide  now,  or  take  the  first  of  the  two  observa- 
tions. Again,  the  expected  value  of  each  move  must  be  calculated.  In 
case  a  decision  is  made  at  once,  T(z)  holds.  In  the  case  of  making  another 
observation,  suppose  y  is  observed.  The  probability  of  y  occurring  for  a 
given  z  2  is  as  before, 

p<y)  =  z2Pi(y)  + (1  -  ^Wy)  (12) 

The  probability  of  H  l  being  true  is  given  by 

z2p1(y) 


PyiHi) 


(13) 


p(y) 


At  this  point,  the  observer  is  in  the  position  of  having  the  probability  of 
Hi  being  true  equal  to  Py(Hj),  and  he  has  an  option  of  taking  one  more 
observation  for  which  the  expected  loss  is  E  i(z).  For  this  observation  y, 
the  expected  loss  is  E^PyiH  x)].  Averaged  over  all  y  the  expected  loss 
is     2     p(y)E1[Py(Hi)l.    When  we  add  the  cost  of  the  observation,    the 

all  y 
expected   loss  for  taking  an  observation   if  two  more  are  permitted  is 


G2(z)=  £  P(y)Ei[Py(Hi)]  +  l 
all  y 


(14) 


To  compare  the  expected  losses  for  taking  and  not  taking  an  observation, 
Figure  5  shows  the  plot  of  G2te),  E2(z),  Ex{z),  and  T(z).    The  optimized 


ACTION   A2 


DEFER 


V 

TAKE 
ACTION  A 


Figure  5.    G2(z\  E2(z),  Ex(z)%  and  T(z). 
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expected  loss  curve  for  the  a  priori  probability  z  and  the  option  of  taking 
as  many  as  two  more  observations  is  given  by 

E2(z)  =  min[T(z),  G2(z)]  (15) 

The  process  generalizes  to  the  following  set  of  steps: 

1.  p(y)  =  znPi(y)  +  (1  -  zn)  p  2(y);  probability  that  the  value  y  will  be 
observed  with  the  option  of  taking  up  to  n  more  observations. 

2.  PyiH^  =  zn    1  y    ;  probability  of  Hi  being  true,  given  the  observa- 

p(y) 

tion  y  and  the  a  priori  probability  zn. 

3.  Gn(z)  =      2     p(y)En_1[Py(H1)]  +  1;  expected  loss  for  taking  an 

all  y 

observation  with  n  possible  observations  remaining. 

4.  En(z)  =  min  [T(z),  Gn(z)];  optimized  expected  loss  with  n  possible 
observations  remaining,  where  T(z)  is  the  expected  loss  of  making 
a  terminal  decision. 

In  the  case  of  no  observations  being  permitted,  the  set  of  decision  points 
is  the  same,  y0  =  §0,  and  in  general,  the  decision  points  are  a  pair,  yn 

and  5n,  the  intersection  of  T(z)  and  Gn(z). 


AN  EXAMPLE:  THE  NORMAL  CASE 


Consider  the  case  where  the  logarithm  of  the  likelihood  ratio  is  nor- 
mally distributed  under  either  hypothesis;  it  follows  that  it  will  be 
similarly  distributed  under  the  other  hypothesis. 

Without  loss  of  generality  we  can  consider  the  observation  to  be  real 
valued  with  normal  density  functions  with  unit  variance,  and  means  zero 
and  d '.  Let  the  cost  of  a  Type  I  error  be  co  1 2  and  that  of  a  Type  II  error 
be  w2i>  and  let  the  cost  of  deferring  for  one  observation  be  1.00.  If  one 
is  allowed  to  defer  and  take  at  most  n  more  observations,  should  one 
terminate  and  take  action  Alf  or  take  action  A2,  or  defer  and  take  an 
observation?   This  is  the  decision  problem  at  * 'Stage  n." 

n  =  0:   Specifically, 

Pl(v)=_l_e-i/2y2 

^/2^ 
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and 


T(z)  is  the  expected  loss  for  a  terminal  decision.  Let  z0  be  the  given  a 
priori  probability  of  Ht.  We  have  no  observations  left  to  take,  and  we 
must  make  a  decision.  This  is  the  simple  two-altemative  probability 
case;  as  previously  discussed  in  this  paper  we  say  Hx  when  0  <  z0  <  y0 

and  Ho  when  y0  <  zx  <  1,  where  y0  is  given  by  y0  = — .    Thus 

our  expected  loss  for  a  terminal  decision  is, 

T(z0)  =  z0co12  0<zo<yo 

(16) 
=  (1  -  z0)  co21    y0<z0<l 

n  =  l:  We  have  the  possibility  of  taking  one  more  observation.  We  wish 
to  know  if  we  should  accept  Hx  or  H2,  or  defer  and  take  our  one  observa- 
vation,  at  which  time  we  will  then  apply  the  results  of  the  0  state.  We 
are  given  zlf  the  a  priori  probability  of  H^  From  Bayes'  theorem  we  are 
able  to  calculate  the  a  posteriori  probability  z0  on  the  condition  we  have 
an  observation  y. 

Zo  =  Py(H1)=z1lM 

p(y) 

p(y)  =  z1p1(y)  +  (1  -  2j)  p2(y) 


But, 


Therefore, 


*o  =  Pr(Hi)   =  ZlC 


X2/2 


Zle-y  n 


+  (l-z1)e     \  2    J 


(17) 


zt  +  (l-z1)eyd'e 


S2 


f(7)  -+« 


Gi(z)  =  o)21  J        p(y)(l-z0)oj2idy  +    J      p(y)x (z0)co12dy+l   (18) 
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Substituting  the  expressions  for  p(y)  and  z0  from  the  above  we  have, 

'(7)  -+oo 

where  f(y)  =  the  value  of  y  for  which  PyiHJ  =  y,  and 


n   i    ^  Tf(r)  r+°° 

CtiU)  =  w21  /       (1 -21)p2(y)dy +  6)12  /       2iPi<y)dy  +  l 


V2»7 


«ai(l-*i)  f'^-"'  _,a 


■/ 


e 


/2cft 


V2/7      Jny) 


We  now  wish  to  find  an  expression  for  f(y). 

Py(H1)=  il 


21  +  (l-^i)e-d  /2e+>-d 
For  PyiHx)  =  y,  what  is  the  corresponding  value  for  y? 


Explicitly, 


Therefore, 


y  =  f(y)    &  (l_2l)e-^2/2^^  =  fi(l^y) 


d'\2  yd-zJJ 


(19) 


(20) 


G1(z)  =  (1-Z^     fLle-<2/2dt+^    (+00e-<2"dt+l 
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To  find  the  decision  points,  set  T(z)  =  Gx{z)  and  solve  for  z.  Specif- 
ically, for  the  lower  decision  point  z    ,  solve, 

zyco12  =   G1(zy)  (21) 

for  zy.  For  the  upper  decision  point  zs ,  solve, 

co21(l-zB)  =  G1(zB)  (22) 

for  z8.  Equations  (21)  and  (22)  for  the  upper  and  lower  decision  points 
cannot  be  solved  in  generality.  To  illustrate  further,  let  con  =  co2i  =  w. 
Gi(z)  then  becomes, 

G(z)  =  (l-z)w$f±ln-2-  -^X+zw^Mn^-  -^ \+  1  (23) 

V       1-z        V  \  d'     l-z       2) 


where: 


t2/2dt 


To  obtain  the  upper  and  lower  decision  points  we  must  solve, 


zyw 


and 

wil-zB)Ml-zh)w<S>(±fa^^-^zBw®f-Hn-^ ^Ul       (25) 

l(i       1-Zg      2J  I  d'     l-zs      2) 

Of  course,  by  symmetry,  zg  =  1  —  z  .  This  may  be  used,  or  saved  for  a 
check.  Equations  (24)  and  (25)  were  solved  graphically  for  w  =  15  and 
d'=  1.  The  lower  decision  point  is  zy  =  0.343;  the  upper  decision  point, 
zs  =  0.657.   Thus,  our  decision  criteria  are  as  follows: 
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•  For  0  <  z  <  0.343,  choose  H2;  the  expected  loss  is  15  z. 

•  For  0.343  <  z  <  0.657,  defer;  the  expected  loss  is  G  ^z). 

•  For  0.657  <  z  <  1,  choose  Ha  the  expected  loss  is  15  (1  —  z). 
n  =  2:   This  is  left  for  the  reader. 

CONVERGENCE 

To  prove  that  the  process  converges,  we  shall  show  that  all  En(z)  are 
bounded  below  by  0,  and  for  each,  z  is  a  monotonic  nonincreasing  func- 
tion of  re.  Thus  there  is  a  limiting  function  E(z),  and  corresponding 
limiting  decision  points  y  and  8.  Further,  the  limit  is  non-degenerate, 
i.e.,  y  >  0  and  8  <  1,  and  there  exists  a  z  such  that  E(z)  ^  0.  Both  the 
lower  bound  and  monotoneity  follow  inductively. 

Note  that  T(z)  is  non-negative,  since  co{]  >  0  and  T(z)  =  min 
[z  co  i 2,  (1  —  z)  &>! 2]  both  non-negative  quantities.  Since  EQ(z)  is  equal 
to  T(z),  E0(z)  is  non-negative.  Now  if  some  Ek(z)  is  non-negative,  any 
average  of  Ek(z)  values  is  non-negative.  Thus  Gk  +  1(z),  which  is  an 
average  of  Ek  values,  plus  the  cost  of  observation  (which  is  1)  is 
bounded  below  by  1.  Finally,  Ek+1(z)  is  simply  the  lesser  of  Gk+1(z) 
and  T(z),  and  so  it  is  non-negative  and  the  induction  is  complete.  This 
establishes  the  lower  bound  for  En(z). 

It  also  follows,  since  Gn(z)  >  1  for  all  n  and  z,  that  for  very  small  and 
very  large  z,  a  terminal  decision  is  always  appropriate.  Specifically, 

0<z<  r^— 


or 

l--i-  <z<\ 


Y  =*  T(z)  <  1  =»  En(z)  =  T(z) 


CO 


21 


This  establishes  the  nondegenerate  bounds  on  the  decision  points, 


y 


>  _J 8    <1 - 


CO  '  ^21 

12 


and  confirms  the  contention  that  at  least  for  each  n,  En(z)  is  not  iden- 
tically zero,  since  En  I — —]  =  1  for  all  n. 

\(012j 
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The  monotone  decreasing  nature  of  En(z)  is  also  established  by  induc- 
tion.  By  definition,  since  E0(z)  =  T{z), 

Ex(z)  =  min  [E0(z),  G^z)]  <  E0(z) 
Suppose 

En_  x(2)  <  En_2(z),  i.e.,  En_  x{z)  -  En_2(z)  <  0 

0^)-0_1to)  -Jp(y)E"^[,  +  (l!a)^)]'» 

=/p(y)f-'Ual^(y)] 

since  p(y)  is  non-negative  and  En_j  -  En_2  is  nonpositive  at  every  y. 
Since  Gn(z)  <  Gn^x(z),  En(z)  <En_1{z),  and  the  induction  is  complete. 

From  the  viewpoint  of  the  subsequent  computer  work,  we  should  pull 
out  one  comment  from  the  above  proof.  If  at  some  pair  of  successive 
stages,  the  maximum  discrepancy  of  En(z)  and  En_1(z)  is  e,  then  Gn+1(z) 
will  be  within  e  of  Gn(z),  and  this  will  enforce  a  corresponding  bound  in 
the  maximum  discrepancy  between  En+1(z)  and  En(z).  Thus  the  amount 
of  reduction  of  En(z)  *  'gained* '  each  iteration  is  monotone  decreasing. 

COMPUTER  ANALYSIS 


The  expressions  of  Equations  (13),  (14),  and  (15)  define  the  process 
by  which  we  may  iteratively  determine  the  En  under  either  a  truncated  or 
a  nontruncated  process.  Since  hand  calculation  is  prohibitive,  we  re- 
sorted to  the  use  of  the  University  of  Michigan's  IBM  704.  The  solution 
for  normal  distributions  of  mean  zero  and  d"s  of  2,  1,  and  0.5  is  given 
here;  however,  changes  required  for  other  parameters  or  other  distribu- 
tions or  changing  costs  will  be  obvious.  The  binary  cards  in  machine 
language  for  this  program  can  be  made  available.  The  flow  diagram  of 
Figure  6  was  used  to  set  up  the  computer  program.  The  parameter  d', 
roughly  speaking,  is  a  measure  of  the  discriminability  content  of  an 
observation. 
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<: 
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Figure  6.    Computer  flow  diagram. 


CONCLUSIONS 


Professor  Goode  did  not  write  any  conclusions  before  his  unexpected 
death;  the  writer  will  mention  only  the  comments  Professor  Goode  made 
after  he  had  glanced  at  his  computer  data.  The  reader  is  as  able  as  the 
writer  to  draw  other  conclusions  from  the  data. 

Professor  Goode  was  interested  in  determining  how  quickly  the  inter- 
section points  y  and  8  of  the  "terminal  loss  curve"  and  the  "further 
observation  loss  curve* '  converge  to  definite  limits  (see  Figures  7,  8, 
and  9).  He  was  interested  in  this  because  he  wanted  to  be  able  to  take 
the  costs  of  errors,  the  cost  of  deferring  a  decision,  and  the  a  priori  prob- 
ability off?!,  and  from  these  parameters  find  the  "optimum  points'*  a  and 
j8  and  the  minimum  expected  number  of  deferrals  in  terms  of  the  costs 
and  a  priori  probability  of  Hx.  The  "optimum  points"  a  and  /3  are  de- 
fined as  the  coordinates  on  the  power  curve,  or  ROC.2  In  other  words, 
Professor  Goode  wanted  to  take  the  costs  involved  in  deferring  a  de- 
cision, compare  this  against  the  costs  in  making  a  terminal  decision, 
and  from  this  find  the  point  of  operation  on  the  ROC.     This  is  the  in- 


2The  power  curve  [lO]  is  a  plot  of  Type  I  error  versus  Type  II  error,  for  all 
values  of  y.  It  has  been  shown  that  for  "optimum' *  performance  in  making  de- 
cisions, one  should  operate  on  the  upper  boundary  of  the  ROC. 
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verse  of  the  problem  solved  by  Wald  in  his  formulation  of  sequential  anal- 
ysis. Wald's  procedure  was  to  first  set  his  point  on  the  ROC,  i.e.,  pick 
a  and  /3,  and  then  find  a  sequential  test  that  minimized  the  expected 
number  of  deferrals  to  meet  his  operating  point  on  the  ROC. 

Table  I  shows  how  quickly  the  y  and  8  converge  to  definite  limits;  this 
is  the  conclusion  drawn  by  Professor  Goode  from  his  computer  data.  The 
y  and  8  points  converge  very  rapidly  to  definite  limits  for  the  costs  and 
d's  examined. 

The  remainder  of  this  paper  is  Professor  Goode* s  work. 

Examination  of  the  relationship  between  the  "terminal  loss  curve' '  and 
the  "further  observation  loss  curve' '  yields  an  interesting  insight  into 
the  motive  of  penalties  and  gains  for  acting  in  one  or  another  nonoptimum 
fashion.  Figure  10  i  s  a  plot  of  the  difference  between  a  typical  pair  of 
such  loss  curves.  Between  y  and  8  the  curve  represents  the  added  loss 
incurred  when  z  is  between  these  limits  and  a  decision  is  made  without 
taking  advantage  of  the  possibility  of  making  more  observations.  On  the 
other  hand,  f or  z  <  y  and  z  >  8,  the  dip  in  the  curve  represents  the  added 
loss  for  delaying  decision  by  taking  more  observations.  While  the  "fur- 
ther observation  loss  curve* '  is  represented  here  generally,  it  must,  of 
course,  correspond  to  some  specific  allowable  number  of  further  observa- 
tions. However,  for  the  numerical  values  of  the  costs  chosen  in  the 
computed  examples  (where  the  cost  of  a  loss  is  large  relative  to  the  cost 
of  an  observation),  the  greatest  change  occurs  between  the  "terminal 
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TABLE  1 
Decision  Points  for  the  Normal  Case* 
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Figure  10.  Difference  between  "terminal  loss  curve' 
loss  curve." 


and  "further  observation 


loss  curve' '  and  the  " one-further-observation-allowed  loss  curve."  After 
that,  relatively  smaller  effects  are  obtained  from  two  or  more  observations 
allowed.  In  the  case  of  costs  approximately  equal  to  the  cost  of  observa- 
tion, the  penalty  for  taking  an  observation  holds  over  the  entire  range 
of  z. 

In  summary,  when  a  priori  knowledge  says  that  the  probability  of  one 
or  the  other  hypothesis  being  correct  is  high,  cost  for  delaying  is  in- 
curred. When  a  priori  knowledge  is  uncertain,  it  pays  to  gather  informa- 
tion if  the  cost  of  observations  is  not  great  relative  to  the  possible 
losses  incurred  with  decision. 
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APPLICATIONS 

The  developments  in  decision-making  techniques  have  implications 
for  many  practical  areas;  among  these  is  the  search  radar.  At  the  outset 
in  using  a  search  radar,  the  human  operator  was  provided  with  a  display; 
with  little  instruction  concerning  the  decision  to  be  made,  he  was  told  to 
report  all  "targets' '  detected-  The  elements  of  a  decision  mechanism 
are  clearly  recognizable  in  the  process.  "Target"  is  one  hypothesis; 
"noise* '  is  the  alternative.  The  human  was  not  instructed  concerning  the 
choice  of  a  cutoff  point,  and  widely  varying  procedures  were,  and  still 
are,  used.  These  varied  from  waiting  for  many  scans  before  decision 
to  requiring  relatively  few,  or  from  reporting  varying  degrees  of  intensity 
blips  to  reporting  only  very  intense  blips.  In  warfare  the  a  priori  prob- 
ability was  occasionally  introduced,  to  some  extent,  by  putting  the 
operator  on  the  qui  vive  when  enemy  raids  were  expected. 

With  the  need  for  automatic  detection,  some  attention  had  to  be  given 
to  the  process  of  deciding  between  target  and  noise.  In  one  radar,  the 
pulse  repetition  frequency  was  such  that  about  ten  hits  could  be  expected 
on  a  target  in  a  single  scan.  In  the  mechanization  of  detection  in  the 
radar,  the  cutoff  point  was  set  at  four  pulse  returns  above  an  arbitrary 
threshold  level.    This  choice  was  made  on  an  intuitive  basis. 

As  the  art  developed,  some  statistical  technique  crept  in.  The  fact 
that  the  set  might  be  saturated  with  targets,  many  false,  led  to  the  con- 
sideration of  methods  for  implementing  a  change  in  the  threshold  above 
which  a  return  was  called  a  target.  This  threshold  was  so  manipulated 
that  the  "false-alarm  rate,"  i.e.,  the  Type  I  error,  a,  was  kept  constant. 
Thus  records  were  kept  of  the  number  of  targets  turning  out  to  be  false 
(discovered  by  the  fact  that  the  next  return  did  not  materialize),  and  the 
threshold  was  reset  so  as  to  keep  the  fraction  constant. 

While  this  implementation  was  not  introduced  to  improve  decision 
making,  it  began  to  use  elements  associated  with  the  anatomy  of  decision. 
More  recently  the  a  and  /3  errors  have  been  introduced  into  the  radar  and 
a  sequential  test  employed  [10],  storage  being  provided  for  information 
on  the  successive  returns  from  a  given  set  of  pulses.  To  date,  to  this 
author's  knowledge,  no  attempt  has  been  made  to  introduce  the  a  priori 
probability  of  a  target  occurring  into  the  setting  of  threshold,  and  no 
place  has  been  provided  for  the  choice  of  a  and  /3  based  on  costs  of 
errors.  The  implication  of  deferred  decision  is  that  information  should 
be  stored  scan-to-scan  and  that  the  a  posteriori  probability  of  a  target 
should  be  computed.  When  the  value  of  the  a  posteriori  probability  is 
less  than  y,  the  information  should  be  discarded.  When  the  a  posteriori 
probability  is  more  than  5,  a  target  should  be  recorded  and  the  informa- 
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tion  discarded,  unless  a  decision  does  not  need  to  be  made,  in  which 
case  the  a  posteriori  probability  should  be  recorded.  Of  course,  which  y 
and  8  are  to  be  used  depends  on  the  values  of  costs  and  a  priori  probabil- 
ities. These  would  need  to  be  precomputed  and  stored  in  the  radar  for 
use  in  setting  the  thresholds. 
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Bayesian  Decision  Theory 


This  article  is  based  mainly  on  the  joint  research  of  Robert  Schlaifer 
and  myself.  At  the  Harvard  Business  School,  we  administer  a  rather 
modest  research  project  on  Applied  Decision  Processes,  trying  to  develop 
the  so-called  Bayesian  viewpoint  in  statistics.  We  attempt  to  formulate 
meaningful  and  tractable  problems,  develop  new  techniques,  compute 
charts  and  tables,  write  computer  programs,  etc.  In  short,  we  try  to 
bring  the  theory  to  a  point  where  it  can  be  applied  to  a  wide  spectrum  of 
real  decision  problems  under  uncertainty.  The  problems  we  meet,  of 
course,  fall  in  the  domain  of  managerial  economics,  but  I,  like 
L.  J.  Savage,  would  argue  that  the  viewpoint  we  espouse  is  relevant  to  a 
much  wider  domain.  I  would  like  to  give  you  a  general  progress  report 
on  our  activities. 

Schlaifer  has  written  an  introductory  text  [1]  which  although  elemen- 
tary, is  exclusively  Bayesian  in  orientation.  The  novice  reading  his  book 
and  taking  our  courses  based  on  it  does  not  find  out  until  the  end  that 
other  schools  of  statistical  thought  exist.  Before  Schlaifer  explains  and 
partially  rejects  such  concepts  as  tests  of  hypotheses  and  confidence 
intervals,  the  student  has  already  handled  decision  problems  in  a 
Bayesian  fashion  that  are  the  analogs  of  the  classical  formulations  of 
these  problems. 

A  second  book  written  by  Schlaifer  [2J  deletes  from  his  first  several 
topics  in  probability  and  operations  research  (inventory  models,  queues, 
simulation),  and  then  presents  the  remaining  topics  in  statistics  in  a 
somewhat  permuted  order.  An  attempt  is  made  first  to  introduce  and 
criticize  the  Neyman-Pearson  theory  and  then  to  show  how  the  Bayesian 
viewpoint  overcomes  some  of  these  difficulties  and  extends  the  applica- 
bility of  the  model.  Many  statistics  teachers  find  this  book  easier  to 
follow  and  I  suspect  it  was  written  principally  for  them. 
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We  take  it  as  given  that  the  object  of  analysis  in  a  decision  problem 
under  uncertainty  is  to  identify  a  course  of  action  (which  may  or  may  not 
include  experimentation)  that  is  logically  consistent  with  the  decision 
maker's  own  preferences  for  basic  consequences  (expressed  by  numerical 
utilities)  and  with  the  weights  he  attaches  to  the  possible  states  of  the 
world  (expressed  by  numerical  weights).  Call  these  subjective  probabil- 
ities if  you  will. 

For  many  years  I  struggled  with  the  philosophical  foundations  of 
statistics,  but  now  I  am  a  full-fledged  Bayesian  convert.  I  do  not  want 
to  dwell  on  philosophical  polemics,  but  to  show  the  relevance  and  power 
of  this  new  orientation  to  statistics.  I  say  "new"  even  though  the  wave 
is  derived  from  Reverend  Bayes.  However,  part  of  the  success  or  failure 
of  this  theory  depends  on  the  measurement  inputs  to  the  model.  In  the 
Bayesian  school  of  thought  these  inputs  —  utilities  and  probabilities  — 
have  a  large  subjective  component.  Can  these  highly  subjective  inputs 
be  elicited  from  responsible  decision  makers? 

•  It  depends  on  the  decision  maker. 

•  It  is  largely  a  function  of  the  training  of  the  decision  maker  and  in 
the  prestige  of  his  statistical  consultant.  To  put  it  crudely,  a  $200-a-day 
consultant  can  get  a-  businessman  to  answer  more  relevant  hypothetical 
questions  than  a  $30-a-day  consultant. 

I  would  like  to  call  your  attention  to  a  delightful  book  written  by  Jack 
Grayson  [3],  in  which  he  investigated  the  "utility  for  money"  curves  of 
many  oil  wildcatters  and  got  geologists  and  geophysicists  to  specify 
subjective  probabilities  of  hitting  oil  in  actual  drilling  deals.  This  work 
was  connected  with  our  project,  and  I  hope  that  more  studies  will  be 
made  investigating  the  administrative  feasibility  of  obtaining  subjective 
utilities  and  probabilities  in  other  domains  of  application. 

Let  me  turn  to  a  more  formal  account  of  work  already  accomplished. 
Schlaifer  and  I  recently  published  a  book  on  our  joint  investigations  [4], 
and  I  would  like  to  sketch  the  broad  outlines  of  this  work.  Later,  I  will 
indicate  some  of  our  current  and  proposed  future  research  and  that  of 
some  of  our  doctoral  students. 

In  rough  terms,  we  consider  a  class  of  decision-making  problems  where 
the  state  of  the  world  —  a  parameter  value  —  is  unknown  and  where  partial 
insights  into  this  unknown  can  be  obtained  from  experimental  evidence. 
Given  the  data  of  a  decision  problem,  we  investigate  whether  or  not  the 
decision  maker  should  gather  experimental  evidence;  if  so,  what  kind,  how 
much,  and,  conditional  upon  the  outcome  of  the  experimental  results, 
which  terminal  action  he  should  adopt. 

To  give  you  a  better  perspective  of  the  types  of  problems  we  are  in- 
terested in,  let  me  state  two  types  of  problems  we  purport  to  handle: 


. 
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1.  A  decision  maker  (DM),  an  entrepreneur,  wants  to  know  how  much 
of  an  item  to  produce  or  stock  in  a  one-shot  inventory  problem. 
Costs  of  overage  and  underage  are  given.  The  demand  d  for  the 
product  is  unknown  and,  on  the  basis  of  a  mixture  of  regression 
analysis  and  judgment,  a  prior  distribution  is  placed  on  d.  Should 
the  DM  act  on  the  basis  of  his  information  or  should  he  conduct  a 
sample  survey,  at  a  cost,  in  order  to  better  determine  d?  If  so, 
how  should  he  balance  cost  of  experimentation  and  information? 
After  the  experimental  evidence  is  at  hand,  how  much  of  the  item 
should  the  decision  maker  stock  on  the  basis  of  this  evidence, 
his  prior  information  as  to  the  economics  of  the  problem  and  his 
attitude  toward  risk? 

2.  The  DM  must  decide  on  one  of  r  processes  or  treatments.  Treat- 
ment i  is  characterized  by  an  unknown  yield  parameter  \l  i  and 
profitability  is  a  known  function  of  /zf .  At  a  cost,  we  can  sample 
from  process  i.  We  might  have  differential  information  about  the 
different  processes,  sampling  costs,  etc.  As  a  function  of  the  prior 
probability  assessments  we  make  about  the  processes  and  the 
economics  of  the  problem,  how  much  should  we  sample  from  each 
of  the  processes  before  coming  to  a  terminal  decision? 

More  formally,  we  assume  the  following  six  ingredients  are  initially  pre- 
scribed in  the  formal  model: 

1.  Act  Space:  A  =  \a\,  the  set  of  available  terminal  acts  to  be  con- 
sidered. 

2.  State  or  Parameter  Space:  ©  =  \6\,  the  set  of  unknown  states  of 
the  world,  i.e.  the  domain  of  the  unknown  parameters). 

3.  Set  of  Experiments:   E  =  \e\. 

4.  Space  of  Sample  Outcomes:   Z  =  \z\. 

These  four  spaces  enter  into  the  strategic  problem  in  a  chronological 
order  depicted  by  the  (symbolic)  decision  flow  diagram  or  game-tree  of 
four  moves  as  follows: 
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Figure  1.    Game  tree  or  decision  flow  diagram. 
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At  move  1  the  DM  chooses  an  e  from  E;  at  move  2  chance  (C)  responds 
with  a  z  from  Z;  at  move  3  the  DM  chooses  a  terminal  act  a  from  A;  and 
at  move  4  chance  responds  with  a  6  from  ©.  In  many  problems  the  6 
chosen  by  chance  at  move  4  will  never  be  disclosed  to  the  DM  but  this 
will  not  interfere  with  the  ensuing  analysis.  At  the  end  of  the  four 
moves,  the  play  (e,z,a,6)  is  evaluated  according  to  the  preferences  of  the 
DM  by  means  of  a  Von  Neumann  utility  function  u.  This  is  the  fifth  in- 
gredient of  the  formal  model.  Given  any  three  plays  (e(l),  z(l),  a(0,  6(0), 
for  i  =  1,2,3,  the  u  can  be  partially  characterized  as  follows:  Letting 
u(e(i),  z(i\  a(i\  0(i))  be  denoted  by  u(i)  for  short,  then: 

1.  w(1)  >  w(2)  if  and  only  if  the  first  play  is  at  least  as  desirable  as 
the  second,  and 

2.  If  the  first  play  is  preferred  to  the  second,  which  is  preferred  to 
the  third,  (i.e.,  if  w(e(1)  >  w(2)  >  w(3))  and  if  a  is  such  that 

u^  =  au^+(l-a)u^ 

then  and  only  then  is  the  second  play  considered  indifferent  to  the 
mixed  option  of  receiving  the  first  play  with  probability  a  and  the 
third  with  probability  1  -  a.    Thus  u  reflects  not  only  the  decision 
maker's  preferences  for  plays  but  his  attitudes  toward  risk. 
Now  to  the  probability  measures  involved,  the  sixth  and  last  ingredient 
of  the  formal  model.    I  assume  we  are  given  a  probability  measure  Pg>  Je 
over  the  product  space  ©  x  Z  f or  each  experiment  e.    I  am  suppressing 
measure-theoretic  considerations  (e.g.,  what  is  the  a-algebra  of  measur- 
able sets,  etc.?).    You  can,  if  you  wish,  think  of  the  spaces  as  discrete 
to  avoid  the  torturous  labyrinths  of  measurability  and  the  subtleties  of 
measure-theoretic  conditional  probability. 

From   the  measure  Pe)Z\e  we   can  derive  four  associated  measures: 

1.  The  marginal  distribution  on  ©,  denoted  by  P0,  which  we  assume 
does  not  depend  on  e; 

2.  The  conditional  measure  on  Z  given  (0,  e)  and  denoted  byP  ifl    . 

3.  The  marginal  measure  on  Z  -  given  e  but  marginal  with  respect  to 
6  -  denoted  by  P-  \    \ 

4.  The  conditional  measure  on  ®  given  (z,  e)  denoted  by  P$\       . 

To  complete  the  tree  representation  we  observe  that  at  move  2,  chance 
chooses  z  according  to  Pz\e  (marginal  with  respect  to  6)  and  at  move  4, 
chance  selects  6  according  to  P Q\     &.   We  shall  refer  to  P  e  as  the  *  'prior" 

measure  since  this  can  be  assessed  prior  to  the  selection  of  e  and  the 
observation  z;  we  shall  refer  toP  i,     as  the  '  'posterior' '  measure  since 
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this  measure  is  relevant  posterior  to  the  experiment  e  and  after  z  has  been 
observed. 
It  is  typically  the  case  in  applications  that  P0    i     is  not  given  or 

assigned  directly.  Rather,  one  usually  assigns  the  prior  measure  P e  and 
the  family  of  conditional  sampling  measures  P  \e  e.  These,  of  course, 
determine  P  e  i  and,  therefore,  the  marginal  sampling  measure  PJeand 
the  family  of  posterior  measures  P0\     e-     Observe  that  the  game-tree 

(Figure  1)  is  defined  in  terms  of  these  two  latter  measures.  The  mathe- 
matical mechanics  for  converting  measures  P$  and  P z\e     into  measures 

P  I     and  P ff\        involve  the  celebrated  Bayes'  Theorem.    I  will  expand 

Z  \  G  C7  j  Z  ,  © 

on  the  treatment  and  interpretation  of  these  probability  measures,  for 
herein  lies  a  good  deal  of  the  controversy  now  permeating  the  field  of 
statistics. 
The  conditional  (sampling)  measures  Pi  e     are  usually  given  by  model 

considerations  (e.g.,  z  is  the  experimental  outcome  of  n  Bernoulli  trials, 
where  0  is  the  probability  of  success  at  each  trial).  The  assignment  of 
a  measure  P e  to  ®  is  the  primary  center  of  controversy.  For  example,  a 
new  drug  has  a  cure/success  ratio  0  which  is  unknown.  Does  it  make 
sense  to 

•  Put  a  subjective  probability  distribution  over  6? 

•  Work  with  this  probability  distribution  in  the  same  fashion  as  one 
would  work  with  a  so-called  frequency-based,  "objective,"  probabil- 
ity distribution? 

My  answer  is  "yes,"  and  my  convictions  are  based,  in  part,  on  four 
considerations: 

1.  My  own  general  disillusionment  on  the  uses  of  classical,  or  stand- 
ard, "objective"  conventions  for  coping  with  uncertainty,  (e.g., 
significance  levels,  confidence  coefficients,  etc.).  These  conven- 
tions are  ingenious  but  inadequate  devices  designed  to  avoid  the 
use  of  subjective  probabilities. 

2.  A  set  of  basic  underlying  principles  of  consistent  behavior  which, 
to  me,  seem  very  reasonable  indeed,  imply  that  one  should  assign  a 
weighting  measure  to  ©. 

3.  It  is  possible  to  give  operational  meaning  to  the  DM's  subjective- 
probability  distributions  in  terms  of  his  basic  choices  in  action 
situations  that  may  be  of  a  hypothetical  nature. 

4.  Once  this  more  liberal  interpretation  of  probability  is  allowed,  we 
substantially  increase  the  scope  of  applications  of  the  mathematical 
model  of  probability  to  important  decision  problems  in  such  fields 
as  economics,  medicine  and  agriculture. 
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Those  of  us  who  hold  this  viewpoint  -  I  suppose  we  are  still  a  dissi- 
dent minority  —  are  called  Bayesians  because  of  the  central  place  in 
which  Bayes'  Theorem  is  used  to  convert  measures  P#,  P  z\q  e  to  meas- 
ures P  \    and  P q\      .    The  title  is  a  misnomer  for  Bayes'  Theorem  is  not 

very  profound.  Also,  in  some  applications,  it  might  be  more  appropriate 
to   directly  assess  the  measures  Pi     and  Pe\z      rather  than  Pe  and 

P  I  q    .    Here  is  one  such  example: 

An  oil  wildcatter  must  choose  between  two  terminal  acts  (the  A  space): 
drilling  a  well  or  selling  his  rights  in  a  given  location.  The  desirability 
of  drilling  depends  on  the  unknown  states  of  the  world,  the  space,  which 
in  this  case  is  the  amount  of  oil  that  will  be  found.  Before  making  his 
decision  the  wildcatter,  the  DM,  can  obtain  geological  and  geophysical 
evidence  by  means  of  seismographic  recordings  that,  for  our  purposes, 
are  to  be  considered  as  an  available  experiment  in  the  experimental 
space  E.  The  seismic  tests  do  not  directly  give  an  indication  of  the 
amount  of  oil  present  but  give  indications  —  let  us  suppose  perfect  indi- 
cations —  of  the  subsurface  structure.  The  possible  structures  comprise 
the  sample  space  Z.  Previous  experience  with  the  amounts  of  oil  found 
with  various  types  of  geologic  structures  makes  it  possible  to  directly 
assign  in  an  almost  objective  fashion  the  posterior  measures  P q\z  Fur- 
thermore, it  will  be  much  more  meaningful  for  geologists  to  assign  a 
probability  measure  Pj     over  the  potential  subsurface  structures  (i.e., 

over  the  Z  space),  than  it  would  be  to  assign  a  measure  P e  to  the  pos- 
sible amounts  of  oil  to  be  found.  Thus  it  is  more  natural  to  assign 
measures  Pe\z      and  Pi     directly  and  no  recourse  to   Bayes'  Theorem 

is  necessary  to  get  the  measures  used  in  defining  the  game-tree. 

Now  let  us  assume  the  decision  problem  is  completely  specified:  the 
spaces  E,  Z,  A,  are  defined,  the  utility  preference  function  u  is  prescribed 
over  the  set  of  possible  plays  of  the  game  and  the  measures  P  i     and 

Pd\     e  are  assessed  either  indirectly  through  Bayes'  Theorem  or  directly. 

We  now  turn  to  the  analysis  of  the  game  which,  at  least  at  a  conceptual 
level,  is  rather  straightforward.  Perhaps  I  should  say  straight  "back- 
ward" because  the  technique  for  analysis  can  very  suggestively  be  de- 
scribed as  an  "averaging-out-and-folding-back"  process.  It  is  the  usual 
backward-induction  procedure  used  in  dynamic  programming  problems. 
Here  is  how  it  works: 

A  full  play  of  the  form  {e\  z\  a',  (9')  has  the  evaluation  u(e',  z',  a',  #'). 
Now  suppose  the  DM  selects  e ',  chance  gives  z ',  and  the  DM  uses  a '; 
what  is  the  evaluation  of  being  at  this  juncture  of  the  game-tree  before 
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chance  selects  a  0  from  6?  Denoting  this  evaluation  by  w*(e ',  z \  a ')  we 
obtain  by  the  averaging  out  process  that 

u*(e \  z\  a ')  =  Ee\    *  e > u{e %  z\  a ',  0) 

where  the  operator  Eq\    >    z>  expects  over  the  random  variable  6  with 

respect  to  the  posterior  measure   P#  |  z'  e' .      Working  backward  we  next 

denote  the  evaluation  of  the  partial  play  (e ',  z ')  by  u*(e  \z")  and  we  have 
by  the  folding-back  process  that 

a*(e',  z')=  max    u*{e\z',a) 
a  €  A 

Working  backward  once  again  we  denote  the  evaluation  of  the  experiment 
e '  chosen  from  E  by  w*(e '),  and  have 

K*(e')   =   Ez|e'W*(e',  z) 

where  the  operator  E  i  ^expects  over  the  random  variable  z  with  respect 
to  the  marginal  sampling  measure  Pje'.  Finally,  the  value  of  the  game 
is  given  by 

u*=  max  ii*(c') 

It  is  well  to  reflect  for  a  moment  the  generality  of  the  E,  Z,  A  formu- 
lation. If  we  keep  in  mind  the  fact  that  these  spaces  are  abstract,  I 
assert  that  a  good  portion  of  mathematical  statistics  can  be  so  formulated. 
However,  as  most  of  us  come  to  realize,  it  is  far  easier  to  formulate  a 
problem  than  to  solve  it.  If  each  of  the  four  spaces  contains  but  a  few 
elements,  the  averaging-out-and-folding-back-process  can  be  handled  by 
simple  arithmetic.  If  the  spaces  are  finite  and  modest  in  size,  we  can 
haul  out  the  digital  computer  and  let  it  help  us  with  a  few  million  of  the 
inevitable  multiplications.  If,  as  is  most  likely  the  case,  some  of  the 
spaces  are  Euclidean,  some  structure  must  be  imposed  to  make  the  prob- 
lem analytically  tractable.  I  shall  consider  two  methods  of  simplifying 
the  problem: 

•  Structure  on  the  utility  function  u. 

•  Structure  on  the  probability  measure  P    0\   . 

As  one  illustration  of  an  assumption  on  u,  let  us  suppose  u{e,z,a,0)  is 
linear  in  6  for  each  given  {e,z,a).    Then  observe  that 


uHe,z,a)  =  Ee\     e  u{e,z,a,Q)  =  u(e,z,a,6  z  e 


where  6  z>e  is  the  mean  of  the  posterior  distribution  of  the  random  vari- 
able  0  given  that  experiment  e  yielded  outcome  z.     Now  assume  that 
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u*(e,z)  =  max   u*(e,z,a) 
a  e  A 

is  expressible  as  some  function  of  e  and  6 z  .  After  the  DM  has  decided 
to  perform  experiment  e,  but  before  the  outcome  z  has  been  observed,  the 
quantity  0z>e  is  unknown  and  can  be  treated  as  a  random  variable.  There- 
fore one  is  led  to  investigate,  prior  to  observing  the  outcome  of  experi- 
ment e,  the  distribution  of  the  yet-to-be-observed  mean  of  the  posterior 
distribution  of  the  "parameter' '  0. 

In  our  book  Schlaifer  and  I  investigate,  systematically,  prior  distribu- 
tions of  the  means  and  variances  of  posterior  distributions  for  many  im- 
portant data-generating  processes  (e.g.,  Bernoulli,  Pois son,  Normal).  This 
bit  of  analysis  is  not  necessary  in  order  to  choose  an  optimal  a  e  A  after 
{e,z)  has  been  observed.  We  call  this  the  terminal  problem.  Rather,  we 
need  this  machinery  to  decide  whether  or  not  we  should  experiment  and, 
if  so,  how  much  and  what  kind.  We  call  this  the  prepost  or  preposterior 
or  "preposterous' '  analysis. 

As  an  illustration  of  structuring  the  measures  P z  £ie',  let  us  remem- 
ber the  following:  We  not  only  want  to  go  from  a  prior  on  6  via  the  sample 
z  to  a  posterior  on  (9,  but  we  want  to  be  able  to  make  probability  state- 
ments about  the  nature  of  the  posterior  distribution  prior  to  observing  z. 

An  experimental  outcome  z  might  be  quite  involved  to  characterize. 
For  example,  in  a  Bernoulli  process  a  z  may  take  the  form  of  a  sequence 
of  0  and  l's  [e.g.,  z  =  (0,0,1,0,. ..,0,1)];  or,  in  a  Poisson  process azmay 
take  the  form  of  a  sequence  of  inter-arrival  times  [e.g.,  z  =  (t  ,  t  ,...,tr)\. 
We  exploit  those  cases  where  it  is  possible  to  summarize  the  complex 
outcome  2  by  a  simple  sufficient  statistic  y.  By  "simple"  we  shall 
mean  here  that  the  domain  Y  of  y  is  a  Euclidean  space  of  low  dimension- 
ality. We  shall  presently  use  the  space  Y  for  a  dual  purpose.  We  now 
introduce  a  family  3  of  prior  measures  over  the  state  or  parameter  space 
that  is  compatible  or  "conjugate"  to  the  conditional  sampling  measures: 
The  beta  family  of  distributions  is  conjugate  to  the  Bernoulli  data  gener- 
ating process;  the  gamma  family  is  conjugate  to  the  Poisson  data  gener- 
ating process;  the  normal  family  is  conjugate  to  the  normal  sampling 
process.  Roughly,  we  construct  the  conjugate  family  by  simply  inter- 
changing the  roles  of  variables  and  parameters  in  the  algebraic  expres- 
sion of  the  sample  likelihood  function.  The  family  3  is  so  constructed 
that  it  is  possible  to  index  each  member  F  of  3  by  an  element  y  of  Y ,  the 
same  Y  we  used  for  the  space  of  the  sufficient  statistic.  Suppose  in  a 
given  instance  the  DM  chooses  a  prior  from  3  indexed  by  y '  and  then 
performs  experiment  e  and  observes  z  whose  sufficient  statistic  is  y.  The 
prior  over  3  is  now  changed  to  a  posterior  over  3  which  will  still  belong 
to  the  family  3  and  accordingly  it  can  be  indexed  by  an  element  of  Y. 
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Let  the  index  of  this  posterior  distribution  be  denoted  by  y  ".  Then  y  " 
depends  on  y '  and  on  y.  Symbolically  we  can  write  y  "  =  y'  *  y,  implicitly 
defining  the  binary  operator  *  .  In  the  examples  cited  above,  the  star 
operator  is  particularly  simple.  However,  it  is  not  only  necessary  to  be 
able  to  go  from  y'  via  y  to  y"  but  for  preposterior  purposes  (i.e.,  to  be 
able  to  judiciously  choose  an  e  from  E)  we  want  to  be  able  to  find  the 
distributional  properties  of  the  as-yet-to-be-determined  y"  (i.e.,  after  e 
has  been  selected,  but  z,  and  therefore  y,  has  not  been  determined). 

These  distribution  problems  have  been  systematically  treated  by 
Schlaifer  and  myself  in  our  book  [4].  We  confined  ourselves  to  so-called 
simple  random  sampling  where  the  size  of  the  experiment  is  predeter- 
mined. After  setting  up  the  general  problem,  we  carried  out  detailed 
examinations  of  problems  with  linear  and  quadratic  utility  structures,  and 
where  the  probability  measures  have  conjugate  structures. 

A  key  problem  analyzed  rather  completely  in  our  book  is  the  case  of  a 
two-act  problem  with  terminal  linear  payoffs,  normal  sampling,  normal 
prior  distributions  and  cost  of  experimentation  proportional  to  sample 
size  [5].  We  assumed  that  the  sampling  precision  or  variance  was  known 
and  for  this  case  we  included  charts  of  optimal  sample  size  and  the  net 
gain  of  optimal  sampling.  Recently,  Dr.  Arthur  Schleifer  extended  our  re- 
sults to  the  case  of  unknown  precision.  To  us  Bayesians  this  means  that 
the  unknown  precision  parameter  is  given  a  prior  distribution  along  with 
the  other  unknown  parameter,  the  mean  of  the  population.  Schleifer  col- 
laborated with  Mr.  Jerry  Bracken  and  published  their  results  supported 
with  extensive  tables  of  the  Student-t  distribution  [6] . 

Earlier  I  cited  the  work  of  Grayson  on  oil  wildcatting.  Motivated  by 
Grayson's  work,  Dr.  Gordon  Kaufman  in  his  doctoral  dissertation  [7] 
considered  in  greater  detail  some  theoretical  aspects  of  the  drilling 
decision  problem.  Since  oil  deposits  are  approximately  log-normally  dis- 
tributed, he  thoroughly  analyzes  Bayesian-type  decision  problems  based 
on  log-normal  distributions. 

In  another  dissertation  Dr.  Charles  Christenson  [8]  analyzes  strategic 
problems  in  syndicate  bidding  for  securities  issues.  He  thoroughly  re- 
views the  foundations  of  the  Bayesian  viewpoint  and  discusses  its  rele- 
vance and  limitations  for  group  (here  the  syndicate)  decision  problems. 

Schlaifer  and  I  confined  ourselves  to  so-called  simple  random  sampling. 
William  Ericson  recently  extended  many  of  our  results  to  the  case  of 
stratified  sampling.  His  findings  will  be  included  in  his  doctoral  disser- 
tation. 

We  are  currently  investigating  multiparameter  problems  (e.g.  regression 
and  multifactor,  analysis  of  variance  type  models)  —  from  the  Bayesian 
viewpoint,  to  be  sure.  Another  direction  of  our  proposed  research  is  in 
multistage  and  sequential  decision  models.    We  are  investigating  simple 
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problems  of  dynamic  programming,  in  what  Richard  Bellman  calls  the 
adaptive  case,  and  trying  to  evaluate  the  economics  of  getting  external 
sample  information  about  the  random  variables,  the  unknowns,  of  the 
problem.  How  much  is  it  worth,  for  example,  to  increase  the  accuracy  of 
a  forecast  of  demand  in  a  dynamic  inventory  model? 

We  suspect  that  as  the  problems  get  more  complicated  more  and  more 
measurement  parameters  will  arise  and  it  may  not  be  possible  to  give  such 
aids  as  tables  and  charts.  Our  aim  here  is  to  write  out  general  computer 
programs  so  that  the  known  parameters  of  the  problem  can  be  read  into 
the  program  and  then  the  general  program  tells  the  digital  computer  how 
to  take  over.  The  computer  output  would  give,  in  general,  a  strategy  for 
experimentation  and  subsequent  terminal  action. 

I  feel  that  our  research  program  in  the  applications  of  Bayesian  sta- 
tistics to  managerial  decision  problems  is  gathering  momentum.  How- 
ever, so  much  remains  to  be  done  and,  inevitably,  the  going  gets  tougher 
as  the  easy  problems  are  skimmed  off  the  top. 
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MYRON    TRIBUS 


The  Use  of  the  Maximum 
Entropy  Estimate  in  the 
Estimation  of  Reliability 


Recently  Professor  E.T.  Jaynes,  of  the  Washington  University  Physics 
Department,  introduced  a  formalism  for  problems  of  statistical  inference. 
This  formalism  consists  of  forming  an  initial  estimate  based  upon  the 
principle  of  maximum  entropy  and  a  method  for  refining  this  estimate  by 
the  use  of  Bayes's  theorem  [1],  He  showed  how  this  formalism  gives 
rise  to  statistical  mechanics  [2],  Reference  [3]  demonstrates  how  Jaynes's 
formalism  also  provides  a  foundation  for  classical  thermodynamics.  The 
purpose  of  this  paper  is  to  illustrate  how  that  formalism  may  be  applied 
in  the  field  of  reliability  engineering. 

The  importance  of  Jaynes' s  work  to  reliability  engineering  lies  in  the 
fact  that  one  may  easily  incorporate  one's  prior  knowledge  into  the 
estimated  reliability  if  one  uses  his  methods,  whereas  it  is  difficult  to 
do  so  by  the  usual  methods.  Many  important  problems  of  reliability  es- 
timation are  of  a  type  in  which  there  is  a  great  deal  of  information,  but 
very  little  of  it  is  available  in  the  form  of  frequency  distributions.  Most 
of  the  available  methods  for  statistical  inference  require  data  in  the  form 
of  frequencies  of  failure,  distributions  of  times  between  replacement  of 
parts,  and  similar  kinds  of  observations. 

The  following  example  may  illustrate  the  nature  of  Jaynes 's  contribu- 
tion. Suppose  operation  of  a  satellite  for  ten  thousand  hours  is  desired. 
Prior  to  the  assembly  of  the  satellite,  life  tests  of  one  sort  or  another 


♦This  paper  was  prepared  with  the  financial  support  of  the  Missiles  and  Space 
Vehicles  Department  of  the  General  Electric  Company  under  the  supervision  of 
Mr.  John  Youtcheff. 
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will  have  been  run  on  the  various  parts  and  assemblies  of  parts  which 
make  up  the  satellite.  Finally,  a  satellite  will  be  assembled  and  oper- 
ated, say,  for  a  thousand  hours  without  failure.  At  this  point  the  ques- 
tion will  certainly  arise,  "What  can  we  now  infer  about  the  reliability  of 
the  satellite?,, 

The  statistician  who  approaches  this  problem  must  cope  with  the  fol- 
lowing dilemma: 

1.  If  he  tries  to  ignore  the  life-test  data  already  accumulated  on  the 
parts,  no  one  will  believe  his  estimate. 

2.  If  he  tries  to  incorporate  the  life-test  data,  there  is  no  agreed-upon 
procedure  he  may  employ  acceptable  to  other  statisticians.  Nor 
can  he  make  such  a  simple  and  convincing  derivation  (by  conven- 
tional methods)  that  he  can  convince  nonstatisticians  of  the  cor- 
rectness of  his  methods. 

The  key  to  understanding  this  problem  lies  in  the  precise  definition  of 
the  words  reliability  and  probability.  These  words  have  familiar  mean- 
ings as  used  in  every  day  language,  but  when  we  attempt  to  give  them  a 
scientific  interpretation,  that  is  to  say,  an  interpretation  which  tells  us 
how  to  proceed  in  making  an  application  to  a  problem  of  interest,  we 
must  make  the  meanings  quite  precise.  Failure  to  establish  a  firm  founda- 
tion will  mean  that  the  conclusions  are  not  well  supported. 

The  words  probability  and  reliability  refer  to  concepts,  that  is,  to  things 
of  the  mind.  They  are  to  be  distinguished  from  percepts,  which  are  sub- 
jects of  measurements,  for  instance,  frequencies,  lifetimes,  number  of  re- 
placements, and  the  like.  Our  objective  is  to  find  a  way  of  reasoning 
from  a  given  set  of  percepts  (which  we  loosely  call  "data")  to  a  predic- 
tion about  another  set  of  percepts  (or  " observable s")  which  we  expect 
to  encounter  or  to  learn  about.  Concepts  are  aids  to  logic,  They  are 
products  of  our  imagination.  As  such  they  are  never  "right"  or  "wrong"; 
they  are  either  useful  or  useless,  rational  or  irrational,  self-consistent 
or  inconsistent  with  one  another.  The  result  of  our  mental  manipulations, 
of  course,  may  be  judged  to  be  either  "right"  or  "wrong"  according  to 
whether  or  not  it  agrees  with  an  observable. 

The  reason  it  is  important  to  avoid  using  such  words  as  right  and 
wrong  with  a  concept  is  that  we  know  from  experience  that  there  are  many 
different  ways  of  looking  at  a  subject,  and  that  what  we  find  a  profitable 
point  of  view  today  may  be  supplanted  by  a  more  powerful  view  tomorrow. 
(The  caloric  theory  was  not  wrong,  the  caloric  fluid  was  merely  an  un- 
necessary concept  once  the  mechanical  theory  of  heat  was  shown  to  be 
more  general.) 

Once  we  accept  the  notion  that  the  words  "probability"  and  "reliabil- 
ity" do  not  refer  to  a  state  of  things  but  rather  are  concepts  which  reflect 
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a  state  of  knowledge  we  can  proceed  to  the  important  question  of  how 
numerical  values  are  to  be  assigned  to  represent  these  concepts. 1 

What  we  are  led  to  discuss,  therefore,  is  the  rationale  behind  the  as- 
signment of  numbers  to  the  concepts  of  "probability"  and  "reliability.' ' 
It  should  be  self-evident  that,  in  assigning  a  number  to  the  value  of  a 
"probability,"  one  should  incorporate  all  the  available  data.  That  is, 
we  take  the  concept  of  probability  to  represent  a  state  of  knowledge 
about  something;  it  is  the  measure  of  what  we  believe  about  the  correct- 
ness of  a  statement.  We  are  inescapably  brought  to  the  necessity  of 
discussing  what  constitutes  a  rational  way  of  considering  situations  in 
which  we  do  not  know  everything  but  we  do  know  something. 

The  following  statements  will  be  taken  as  self-evident  axioms: 

1.  In  assigning  values  to  probabilities,  we  should  utilize  all  the 
available  information. 

2.  In  assigning  values  to  probabilities,  we  should  not  include  informa- 
tion we  do  not  possess;  i.e.,  we  should  not  assume  something  is 
true  unless  we  have  evidence  for  it. 

3.  The  method  of  assigning  probabilities  should  be  independent  of  the 
nature  of  the  problem,  that  is,  although  the  analytical  procedures 
may  differ  from  one  application  to  another,  they  should  all  be  de- 
monstrably derivable  from  a  common  basis. 

4.  If  truly  irrelevant  data  are  incorporated  in  a  problem,  the  answer 
obtained  should  not  depend  upon  the  data. 

5.  Anything  that  increases  the  probability  that  a  statement  is  true 
automatically  decreases  the  probability  that  it  is  false  (provided 
the  statement  is  unambiguous). 

What  is  unique  about  the  Jaynes  approach  is  that  it  describes  the  system 
of  inductive  logic  that  is  to  be  employed  before  the  details  are  developed. 
It  is  an  amazing  outcome  that  only  a  few  requirements  need  be  stated 
before  the  rules  for  the  logical  system  are  completely  defined.  If  some- 
one wishes  to  employ  a  different  system  of  logic,  he  is  at  liberty  to  do 
so.  There  is  no  one  to  say  he  is  "right"  or  "wrong."  What  can  be  said, 
however,  is  that  his  system  of  logic  does  not  correspond  to  the  require- 


iThose  who  believe  that  * 'probability"  refers  to  a  state  of  things,  a  property 
of  a  system  (as  proposed  by  von  Mises),  might  well  consider  how  they  would  prove 
to  a  skeptic  that  just  because  a  coin  comes  up  heads  twice  in  a  row  he  should 
not  believe  that  it  is  loaded.  If  he  persists  in  this  belief,  will  you  insist  he  is 
irrational?  Stupid?  If  you  take  this  point  of  view,  you  must  admit  we  are  talking 
about  states  of  knowledge.  (The  coin  could  be  loaded,  you  know!)  In  the  absence 
of  additional  trials  how  can  you  proceed?  Once  you  admit  a  difficulty  with  two 
tosses,  you  must  admit  the  difficulty  is  present  with  three,  five,  ten,  or  a  hundred 
tosses.  And  since  no  one  tosses  a  coin  an  infinite  number  of  times,  the  problem 
is  changed  only  in  numerical  value,  not  in  essence. 
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ments  given  and  that,  if  his  answers  differ  from  ours,  he  is  somehow  vio- 
lating the  rules  laid  down  for  a  system  of  inductive  logic. 

The  importance  of  the  first  statement  above  should  not  be  underes- 
timated. Consider  the  following:  If  a  test  satellite  operates  1,000  hours, 
and  by  all  previous  data  it  should  have  operated  only  10  minutes,  will  not 
the  estimated  reliability  be  different  than  if  by  all  previous  data  it  should 
have  operated  successfully  one  million  hours?  The  first  statement  might 
well  be  paraphrased,  "Don't  sweep  your  prior  knowledge  under  the  rug 
just  because  it  is  difficult  to  use." 


SYMBOLISM  AND  BASIC  PROBABILITY  RELATIONS 

A  capital  letter  denotes  a  proposition: 

A  =  "proposition  A" 

A  small  letter  denotes  the  denial  of  a  proposition: 

a  =  "the  denial  of  A" 

Two  letters  together  (as  in  multiplication)  represent  two  statements 
at  once: 

AB  -  "both  propositions  A  and  B" 

Two  letters  with  a  plus  sign  between  them  represent  "one  or  the  other 
or  both": 

A  +  B  =  "proposition  A  or  proposition  B  or  both" 

A  line  between  two  letters  means  "given  that  the  following  statement 
is  true": 

A  |  B  -  "proposition  A  given  that  proposition  B  is  true" 

Parentheses  around  a  statement  mean  that  the  statement  within  the 
parentheses  is  being  considered  as  to  its  truth  or  falseness.  The  small 
letter  p  means  "probability  of."2 

p(A|B)  -  "the  probability  that  statement  A  is  true 
given  that  statement  B  is  true" 


2 
We    shall  not  use   the   lower  case  letter  p   or  the   capital  letter  P  to  denote 

statements. 
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Jaynes  has  shown  that  if  we  add  the  following  requirements  to  the 
previous  five  statements3 : 

1.  p  should  be  represented  by  real  numbers. 

2.  The  reasoning  should  be  consistent. 

3.  p  should  be  continuous  for  continuous  variation  of  the  given  data. 

The  following  two  equations  emerge  as  a  necessary  consequence: 

p(AB\C)  =  p(A\BC)  p{B\C)  (1) 

and 

p(A\C)m  +  p(a\C)m  =  1  (2) 

The  exponent  m  appearing  in  the  above  equation  is  left  to  choice.  It  is 
entirely  a  matter  of  convention  to  take  m  =  1.  The  useful  feature  of  this 
choice  is  that  if  we  set  m  -  1,  in  many  problems  of  interest  the  value  of 
p  will  have  the  same  value  as  the  predicted  frequencies  of  an  occurrence. 
This  does  not  mean  that  the  probabilities  represent  frequencies.  We 
could,  for  example,  raise  the  first  equation  to  the  fourth  power  and  de- 
velop a  calculus  of  probabilities  in  which  the  probabilities  would  all  be 
fourth  powers  of  the  predictions  for  frequencies.  Nothing  in  the  calculus 
of  probabilities  would  be  changed  except  the  interpretation  of  the  results. 
Since  Equation  (1)  is  symmetrical  with  respect  to  statements  A  and  B 
we  may  write  it  in  the  two  ways: 

p(AB\C)  =  p(A\BC)  p{B\C) 
=  p(B\AC)  p(A\C) 

which,  upon  eliminating  p(AB\C)  may  be  written: 

p(A\BC)  =  p(A\C)  E^i^l  (3) 

p(B\C) 

The  above  equation  is  known  as  Bayes'  Theorem.  The  ratio  of  proba- 
bilities on  the  right-hand  side  of  the  equation  shows  how  the  introduction 
of  information  about  B  serves  to  modify  the  prior  probabilities.    That  is: 

p(A\BC)   represents  the  probability  that  A  is  true  if  we  know  that 
B  and  C  are  true. 


3  In  what  follows  the  methods  of  Jaynes  have  been  slightly  altered  in  form  but 
not  content.  I  am  grateful  to  Professor  Jaynes  for  permitting  me  to  describe  his 
work  in  advance  of  the  publication  of  his  book. 
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p{A\C)      represents  the  probability  that  A  is  true  if  we  only  know 
about  C. 

L.    !  represents  how  we  change  our  minds  when  we  learn  about  B. 

p(B\C) 

It  is  easy  to  demonstrate  that  Equations  (1)  and  (2)  with  m  =  1,  lead 
to  the  familiar  equation: 


p{[A  +  B]  | C)  =  p(A  |  C)  +  p(B  | C)  -  p{AB  |  C)  (4) 


THE  PRINCIPLE  OF  MAX/MUM  ENTROPY 


Quite  often  we  are  given  data  in  the  form  of  " averages' '  rather  than 
frequency  data.  These  averages  serve  to  represent  more  extensive  data. 
For  example,  we  may  know  the  mean  lifetimes  for  different  failed  parts 
but  not  know  the  distribution  of  the  failures.  A  person  transmitting  such 
averages  transmits  an  amount  of  information  which  is  strongly  dependent 
upon  what  we  already  know.  The  problem  of  measuring  the  informational 
content  in  a  message  has  been  formulated  by  Claude  Shannon  [4]  and  the 
measure  proposed  by  him  is  given  by: 


S  =  ~K    £   p'/npi  (5) 


where  S  =  entropy  or  uncertainty;  K  =  arbitrary  constant;  pt  =  p(A,-|X); 
Af  =  the  zth  possible  outcome  of  an  event;  and  X=  the  fact  that  the  event 
X  occurred.  It  is  important  to  recognize  that  S  measures  the  information 
that  is  left  out  when  only  a  probability  distribution  (i.e.,  values  for  the 
Pi)  is  given.  The  function  S  was  shown  by  Shannon  to  be  unique,  that  is, 
to  be  the  only  function  which  satisfies  the  following  requirements: 

1.  It  is  a  function  of  the  probabilities,  plf  p2,  P3,  ...  ,  associated  with 
the  outcomes  Aly  A2,  A3,  .... 

2.  It  follows  an  additive  law,  that  is,  if  S(X)  is  the  uncertainty  about 
event  X  and  if  S(Y)  is  the  uncertainty  about  event  Y,  then  S(XY) 
=  S(X)  +  S{Y)  (the  plus  sign  means  "addition") 

3.  It  is  monotonically  increasing  with  number  of  outcomes  when  the 
Pi  are  all  equal. 

4.  It  is  consistent  and  continuous. 
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JAYNES'  PRINCIPLE 

Jaynes'  Principle  of  Minimum  Prejudice  may  be  stated  as  follows: 

The  minimally  prejudiced  assignment  of  probabilities  is  that  which 
maximizes  the  entropy  subject  to  the  given  information. 

Reference  [3]  demonstrates  how  this  principle  leads  directly  to  the 
theorems  and  laws  of  thermodynamics.  The  point  of  view  adopted  both 
in  that  paper  and  in  this  one  is  that  we  are  dealing  with  a  theory  of 
inf ermation  rather  than  information:  i.e.,  we  are  asking  what  we  may 
legitimately  infer  from -given  data.  In  particular,  subject  to  some  choices 
on  how  complex  we  wish  our  descriptions  to  be,  Jaynes'  principle  enables 
us  to  discover  the  most  rational  methods  of  description  that  do  not  vio- 
late the  requirements  put  down  in  the  first  section. 

If  the  information  upon  which  we  are  to  proceed  is  given  in  the  form 
of  averages,  the  mathematical  procedures  take  a  particularly  simple  form. 
Suppose,  for  example,  that  the  given  information  consists  of  a  set  of 
"averages"  as  indicated  below: 


i 

X,  p/(xi)  =  <f> 


(6) 


(7) 


t 


Pig(Xi)  =  <g>      etc.  (8) 


The  symbol  <  f  >  means  "expected  value  of  f"  and  refers  to  what  we 
expect  to  find  as  the  average  value  of  f.  The  symbol  x  can  be  any 
measurable  quantity  (e.g.,  diameter  of  ball  bearings)  and  fix)  can  be  any 
function  calculable  from  x  (e.g.,  surface  area  or  volume). 

The  maximization  of  Equation  (5)  subject  to  the  constraints  of  Equa- 
tions (6),  (7),  and  (8)  proceeds  easily  by  the  use  of  Lagrange's  method 
of  undetermined  multipliers.  Letting  a0,  al9  d2  ...  be  Lagrangian  multi- 
pliers, we  find: 

p.  =e~a°  "  "i'**^  u  H&<*i)  -  -"  (9) 

as  the  minimally  prejudiced  assignment  of  probabilities.  The  values  of 
the  Lagrangian  multipliers  are  chosen  to  fit  the  given  data.  This  pro- 
cedure leads  to  the  following  equations: 


and 
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=  ln    y  e-*i'<*/>  -  «2*<*i>  -  ...  (10) 

i 

(11) 

(12) 
(13) 


(14) 


s,- 

f(x 

i)   -   *2i 

i 

<f> 

=  - 

3oq 
da1 

<g> 

=  - 

dap 
da2 

variance 

(0 

da2, 

variance 

(g) 

d2a0 

A  detailed  derivation  of  these  equations  is  given  in  the  appendix  to  this 
paper.   An  extensive  discussion  is  given  in  [5]. 

In  considering  the  application  of  these  equations  to  problems  of  relia- 
bility engineering,  we  shall  formulate  all  our  problems  as  if  time  were 
measured  in  discrete  units  and  pass  over  to  the  continuous  variation  for 
calculations  when  necessary. 

THE  EXPONENTIAL  DISTRIBUTION 

As  remarked  in  the  preceding  section,  the  choice  of  data  to  be  recorded 
will  influence  the  kinds  of  probability  distributions  that  can  be  consid- 
ered. The  simplest  possible  description  is  one  in  which  only  average 
lifetimes  are  considered  relevant.  That  is,  we  are  considering  systems 
we  believe  can  be  satisfactorily  described  by  giving  an  average  lifetime, 
L.  Let  us  suppose  further  that  all  we  know  is  that  L  is  finite  and  the 
larger  L  is  the  more  unlikely  it  is.  With  only  this  information  how  should 
we  proceed  to  characterize  the  system? 

•  Let  At  =  an  interval  of  time  sufficiently  small  that  the  accuracy  of 

recording  data  is  sufficient  for  our  purposes. 

•  Let  time  be  reckoned  as  discrete,  i.e.,  let 

tn  =  nkt 

•  Let  X  =  the  given  data. 

•  Let  p(tn\LX)  =  probability  the  system  will  expire  at  time  tn  given 

that  L  is  the  average  life  of  the  system  and  X  repre- 
sents the  available  knowledge. 
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The  given  data  may  be  written  in  the  form  of  two  equations: 


£     pit 


n   =    1 


p  (tn\LX)  =  1,  the  system  fails  in  some  time 


and 


/^    p(tn\LX)  tn  =  L,  the  expected  life  is  L 


n   =    1 


In  view  of  these  constraints  and  the  results  in  the  previous  section,  the 
minimally  prejudiced  value  for  p(tn\LX)  is  the  one  which  maximizes  the 
entropy.    That  is, 


p(tn\LX)  =  e 


<o  -<hh 


and 


a0  =  In    7  .  e 

n 

If  we  let  z  =  exp  (—a  tAt),  then  we  find 


But, 


]pzn  =  z]Tzn  =  zl+£zn 


n   =    1 


n   =    0 


n   =   1 


and 


hence 


.?,--(*)-<■-*■-" 


a0  =  -ln(eaiA'-l) 


From  Equation  (11)  we  find: 


(15) 


(16) 


(17) 


da0  _gaiAf  Af 
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from  which  we  find 

/  A  A    1/At 

(18) 


■- "(>-¥)■* 


For  finite  to,  the  above  equation  determines  the  Lagrangian  multiplier 
a!  as  a  function  of  L.  With  this  value  of  L  the  expression  for  the  prob- 
ability distribution  becomes: 


(19) 


If  we  let  Af  become  indefinitely  small  (continuous  time  variation),  we 
find  from  Equation  (18)  that 

a1   -*    1/L 
and  writing  dt  for  A£,  we  have  in  the  limit 

p(tn\LX)=^e~tn/L  (20) 

Equation  (20)  will  be  recognized  as  the  Poisson  distribution.  Jaynes' 
principle  tells  us  if  we  wish  to  work  with  a  single  parameter  distribution 
and  we  have  no  other  data,  that  we  should  use  a  Poisson  distribution  to 
describe  the  system.  We  shall  consider  later  the  methods  to  be  employed 
in  deciding  the  value  of  L  appropriate  to  a  particular  system. 
It  is  easy  to  demonstrate  that  if  by  the  statement  Bi  we  mean 

Qt  =  we  ran  a  sample  for  time  tj  and  it  did  not  fail, 

we  shall  have,  consistent  with  Equation  (20), 

p(6i\LX)  =  e~0i/L  (21) 


THE  GAUSSIAN  DISTRIBUTION 


In  some  cases  it  is  anticipated  the  expected  life  may  not  be  a  de- 
creasing monotonic  function  of  time,  but  rather  that  a  particular  value 
is  more  probable  than  any  other.  Such  would  be  the  case  for  a  rocket 
engine  or  a  battery,  for  example.    We  let  tn  =  the  life;  L  =  the  expected 
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life;  a  =  the  standard  deviation;  X  =  the  other  things  we  know.    The 
equations  representing  these  facts  are: 

£  p(tn\LaX)  =  1 

n 
n 

and 

^p(tn\LaX){tn  -L)2  =  a2  (22) 

This  last  equation  may  be  written  in  simpler  form  if  the  squared  expres- 
sion is  expanded  and  use  is  made  of  the  previous  equations, 

£  p(tn\LaX)t2  =L2  +o2  (23) 

The  minimally  prejudiced  description  is,  by  Jaynes'  maximum  entropy 
principle, 

p(«„|LaX)=e-a°-a',"-a2'"  (24) 

The  parameters  a0,  at,  and  a  2  are  to  be  chosen  so  that  they  agree  with 
the  given  data,  that  is, 


a0  =  ln£e-ai'"-a2'"  (25) 


-  22ft...   L 

da1 

and 

=    L2  +  a 


da0         T  2  ,      2 


It  is  easy  to  show  that  Equation  (24)  represents  a  truncated  Gaussian 
distribution.    Let 


2a. 
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so  that 


and  therefore 


a2  4    \a2) 


8*    .if+AL+i 


g-a0-al'n-a2'n     -     e-a0  +    1/4   (*\/"2)   ~   a2Sn 

The  probability  distribution  expressed  in  terms  of  8n  is  therefore 
p(tn\LaX)  =  e~ao  +  1/4  <&'«*>  <T«^ 


(26) 


Equation  (26)  represents  a  distribution  which  is  normally  distributed  about 
the  value  8n  =  0,  or  in  terms  of  tn  about 


2    \a2) 


(27) 


The  sign  of  a2  must  be  positive  because  we  require  that  beyond  the  most 
probable  lifetime  the  probability  must  decrease  toward  zero.  Therefore, 
the  sign  of  ax  and  the  ratio  of  a:  to  a2  serve  to  define  what  part  of  the 
normal  distribution  belongs  to  the  problem  at  hand. 

Figure  1  indicates  the  shape  of  the  probability  distributions  which  re- 
sult when  different  values  are  given  to  the  parameters  ax  and  a2.  Curve 
a  represents  the  case  in  which  aja2  is  a  large  negative  number.  Case 
b  represents  a  moderately  large  value  for  —a1/a2.  Case  c  represents  a 
very  large  positive  value  for  a1/a2.   Curve  d  represents  ax  =  0. 


p(tn\LaX) 


Figure  1.  Probability  distributions  for  various  values  of  ax  and  a2  . 
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If  information  about  the  variance  were  missing  from  our  "given"  data, 
we  would  have  the  same  premises  as  led  to  the  Poisson  distribution.  For 
such  a  case  we  find  a2  =  0.  The  Poisson  distribution  therefore  corres- 
ponds to  a  normal  distribution  about  the  point  tn  =  ■  -*» .  (A  simpler  way 
of  saying  this  is  to  observe  that  the  "tail"  of  the  Gaussian  distribution 
is  an  exponential  distribution.) 

If  the  given  information  is  in  the  form: 


^p(tn\LaX){tn-<t>)2  =  a 
*£p{tn\LaX)    =   1 


it  is  easy  to  see  that  the  appropriate  distribution  function  is  the  normal 
Gaussian  centered  about  tn  =  0. 

The  use  of  truncated  Gaussian  distributions  is  discussed  by  Flehinger 
and  Lewis  [6]. 

The  Lagrangian  multipliers  are  related  to  the  expected  life  L  and  the 
variance  o2 .    In  the  relation 

~a1tn-a2t2 


a0  =  In  S\e 


if  the  time  interval  A  t  is  taken  to  be  uniform  and  small,  the  sum  in  the 
previous  equation  may  be  approximated  by  an  integral.  The  easiest  way 
to  visualize  this  process  of  passing  from  the  discrete  to  the  continuous 
case  is  to  note  that  a  graph  of  exp  (— axtn  ~a2t2^  versus  tn  is,  in  the 
discrete  case,  represented  by  a  series  of  vertical  lines  spaced  A*  apart. 
The  sum  in  Equation  (28)  is  the  sum  of  the  length  of  the  lines.  The  area 
under  a  continuous  curve  joining  the  tops  of  the  lines  is,  to  a  good  ap- 
proximation, equal  to  A*  times  the  required  sum.    Therefore, 


a0  =  In 


\t   J, 


°2  *2 


dt 

to 


(29) 


It  is  relatively  easy  to  evaluate  this  integral  in  terms  of  error  functions 
by  the  change  of  variable 


to  find 


2      a\ 

s    = +  axt  +  a2t2 

4a  o 


e-«i<-V2   dt  =  e((h  /4a2>  VF[1  -  erf  (a^aY2)]  (30) 


2a\/2 
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Therefore,  we  have,  from  Equation  (29), 

a0  =  (af/4a2)  -X  In  a2  +  In  [1  -  erf  (fll/2aJ/2)l  +  In  ^  (31) 

From  Equation  (24)  we  have: 

L=_a«o  =  _^       2VFg"(a'/4a2)(2a1/2)-1 
(9ai         2a2  +        l-erf(a1/2a^/2) 

which  we  may  write,  upon  multiplying  by  a  1  and  letting 

x2  =  a\/Aa2 

and 

_    2 


Z  =  (2x/0F) 


,—  X 


1  —  erf  x 


aiL=f(x)  =  -2x2  +  Z  (32) 

By  use  of  Equation  (13)  we  find: 


.2_„     d2an_,,l  ,  Z        22 


a2o2  =  a2^o  =  g{x)  =j-+^-~  ~  (33) 

da,  2      2        4%2 


We  also  note  that 


j7f  =  gW     "i        _  2s  2+  2s2Z  -  Z2  _  4s2g(x)  _  k{x)  (34) 

L2       a2     [f(x)]2       4%4-4x2Z  +  Z2      [f(x)]2 


By  graphing  the  functions  appearing  on  the  right-hand  side  of  Equations 
(32),  (33),  and  (34),  we  can  readily  find  L  and  a  given  ax  and  a2  or  can 
find  a  1  and  a2  given  L  and  a.  That  is,  given  ax  and  a2  we  can  easily 
compute  x  and  Z  and  therefore  compute  L  and  a  from  Equations 
(32)  and  (33).  Given  L  and  a  we  can  enter  a  graph  of  Equation  (34).  This 
gives  us  a  value  for  x  from  which,  using  Equation  (32)  and  given  L,  we 
can  find  ax.  Figures  2  through  6  are  plots  of  axL,  a2o2 ,  and  a2/L2  as  a 
function  of  x. 
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Figure  2.  axL  for  positive  X. 
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Figure  3.   axL  for  negative  X. 
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Figure  4.  a2o2 . 
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Figure  5.  crVLz;  high  range. 
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10   -8    -6    -4    -2 


Figure  6.  o  /L2;  low  range. 


The  use  of  Figures  2  through  6  is  illustrated  by  the  following  example: 

Given:    L  =  1.0,  a  =  0.1.    Compute  the  values  of  al5  a2  and  graph  the 
probability  distribution 

Solution:    We  compute  o2/L2  =  0.01.    From  Figure  6  we  find  x  =  -7.1. 


From  Figure  4,  «2<72  =  0.5,  a2  =  50.    From  Figure  3,  axL  =  -100,  ax 
Since,  for  large  negative  x,  (1— erf  x)  approaches  2, 

/2  2 


100. 


and  therefore 


(?) 


d^e-^7-1)    (50Ar)1/2^ 


~af(-a1<n-a2fn  =  (50/^l/2el0O(_5O(2_5o>4l 


Figure  7  shows  a  graph  of  the  probability  distribution. 


SEQUENTIAL  LIFE  TESTING  WITHOUT  FAILURES  (Poisson) 


In  the  previous  sections  we  have  seen  how  simple  qualitative  state- 
ments serve  to  define  the  nature  of  the  statistical  distribution  functions 
which  should  be  used  to  describe  the  life  characteristics  of  equipment. 
If  we  decide  to  describe  the  life  characteristics  by  a  single-parameter 
distribution,  and  we  have  no  other  knowledge,  the  principle  of  maximum 
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Figure  7.   Probability  distribution  for  maximum  entropy;  L  =  1,  o '  =  0.1,  a.\  =  100, 
a2  =  50. 

entropy  dictates  that  we  use  a  Poisson  distribution.  The  problem  then 
becomes  one  of  deciding  upon  the  parameter  L  appearing  in  the  distribu- 
tion function  (Equation  (20)  or  (21)).  It  is  instructive  to  see  how  the 
acquisition  of  data  affects  our  state  of  knowledge.  We  begin  by  defining 
the  following:  L  =  the  expected  life;  0t  =  we  ran  a  sample  for  time  %i  and 
it  did  not  fail;  and  X  =  what  we  knew  before  we  ran  the  test. 

The  problem  posed  is  to  compute  p(L\djX),  that  is,  the  probability  we 
assign  to  the  parameter  L  in  view  of  the  evidence  0X  and  the  prior  knowl- 
edge X.  One  of  the  distinguishing  features  of  Jaynes'  method  is  that 
every  quantity,  even  the  parameter  of  the  distribution,  is  treated  as  sub- 
ject to  change  in  the  light  of  our  knowledge.  No  pretense  is  made  of 
discussing  a  "true  value* '  of  L  or  the  "population  value"  or  some  other 
quantity  as  though  it  were  possible  to  evaluate  it  "objectively." 

From  Bayes'  Theorem, 

p(L\diX)=p(L\X)    p{0*\LX) 
pWi"|» 
From  Equation  (21)  we  have 


p($i]LX)  =  e 


-dj/L 


The  following  is  a  mathematical  identity,  true  for  any  parameter  Y: 


P«9,.|X)    =     f°°  p(0Y\X)dY 

Jo 
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This  equation  tells  us  that  the  probability  of  a  particular  value  of  6{  is 
the  sum  over  all  values  of  Y  of  the  joint  probability  of  Q{  and  Y.  In  ♦par- 
ticular, we  set  Y  =  L  and  have 

p(0i\X)=  f     p(<9,.L|X)dL 

which,  by  use  of  Equation  (1)  may  be  written, 

p(0,|X)=  J     p(L\X)  p(di\LX)  dL 

Substitution  of  the  above  expression  into  the  equation  for  p(L|#jX)  gives: 


p(L\etX) 


p{L\X)e-6J/L 

J    p(L\X)e-e*/L  dL 
Jo 


(35) 


Equation  (35)  shows  clearly  how  our  prior  knowledge  serves  to  affect  the 
probability  we  assign  to  L.  The  "broader"  or  "flatter"  the  distribution 
function  p(L|X),  the  less  effect  it  has  on  the  final  result.  p(L|X)  repre- 
sents our  prior  knowledge. 

There  is  a  temptation  to  say  at  this  point,  "Since  I  do  not  know  any- 
thing about  p(L|X),  I  shall  assume  it  is  a  constant  and  cancel  it  from  the 
equation."  This  cannot  be  done  for  two  reasons.  First  of  all,  it  vio- 
lates common  sense.  We  do  know  that  L  is  not  infinite.  For  example,  a 
value  for  L  =  10 17  sec  (the  age  of  the  universe)  is  much  less  than  a 
value  for  L  =  7(1010)  seconds  (the  approximate  age  of  Christianity). 
Secondly,  we  cannot  put  p(L|X)  =  constant  because  the  denominator  in 
Equation  (35)  will  then  become  infinite. 

Figure  8  shows  the  function  exp  (— O^L)  as  a  function  of  L.  It  is 
evident  that  as  L  becomes  infinite,  the  area  under  the  curve  is  unbounded. 
Therefore,  it  is  necessary  to  insert  for  p(L|X)  a  distribution  function 
which  vanishes  at  infinity,  otherwise  the  integral  in  Equation  (35)  will  be 
infinite. 
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Figure  8.  p(0f|LX)  =  e~°i/L. 
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If  we  do  not  have  a  great  deal  of  information  about  the  expected  value 
of  the  parameter  h,  we  must  adopt  that  distribution  function  which  we 
have  already  found  to  be  appropriate  as  a  minimally  prejudiced  estimate, 
i.e.,  a  Poisson  distribution.    We  write: 

p(L\X)  =  pdLe~/SL 

where  /3  is  a  single  parameter  which  describes  what  we  know  about  L 
"beforehand.' *  If  £>  is  large,  small  values  of  L  are  expected.  If  /3  is 
small,  L  is  expected  to  be  large.  Putting  this  distribution  function  into 
Equation  (35),  we  find: 

-J3l  -  dt/L 

p(L\6iX)=   * (36) 

fe-^-e^dL 


The  integral  may  be  evaluated  with  the  help  of  a  table  of  Laplace  trans- 
forms, i.e.,  the  Laplace  transform  of  e~  */L  with  respect  to  the  transform 
variable  /3,  letting  L.  be  the  independent  variable  is  given  by  (e.g.[7]): 


Jo 


(37) 


The  expected  value  of  L  is  given  by: 


fLe-/3L-e>/LdL 

L  p{L  |  (9,-X)  dh  =  — (38) 

o 


Jo 


It  is  evident  that  the  numerator  in  the  above  expression  is  the  negative  of 
the  derivative  of  the  denominator  with  respect  to  /3.  Therefore,  using 
Equation  (37)  we  find: 


<L>  =  -  -£-  In  [2v^73  K1(2v/^ii8)]  (39) 

where  Kx  (2y/6fi)  is  a  modified  first  order  Bessel  function. 

Equation  (39)  represents  the  solution  to  our  problem.  The  character  of 
the  answer  we  obtain  depends  strongly  upon  our  assumptions  about  the 
parameter  /3.  If  the  statement  X  implies  that  we  expect  a  life  which  is 
long  with  respect  to  the  observed  test  period,  0„  we  should  use  the  ex- 


122  MYRON  TRIBUS 

pression  for  K  1(2V^IjS)  valid  for  small  values  of  the  argument.  On  the 
other  hand,  if  the  expected  life  is  small  compared  to  the  test  period,  /30f 

will  be  a  large  number  and  we  should  use  the  asymptotic  formula  for 
K1(2v/^jB)  that  is  valid  for  large  values  of  the  argument.  Thus,  even 
though  our  qualitative  information  is  rather  vague,  we  can  still  put  it  to 
use. 

For  example,  suppose  it  is  found  that  the  operating  time  is  short  com- 
pared to  what  the  life  was  expected  to  be.  That  is,  the  test  is  terminated 
without  failure  at  a  time  which  is  short  compared  to  what  is  expected  to 
be  a  reasonable  estimate  of  the  "life."  For  flyjS  small  compared  with 
unity  we  have,  since, 

Lim  Kx  (z)  =  l/z 
z  ->o 


Lim      2V<V0   K1(2V6>fj8)   =   l/j8 

2v/0i/3-»O 

Therefore,  from  Equation  (39) 

<L>  =  _<Lin:L=i 

In  words,  "The  expected  value  of  L  is  unchanged  by  the  evidence  0j  if 
the  time  of  running  without  failure,  tt,  is  small  compared  with  the  ex- 
pected life  before  the  test  is  run."  On  the  other  hand,  if  the  equipment 
fails  during  a  test  which  is  short  compared  to  the  expected  life,  a  dif- 
ferent conclusion  will  be  reached. 

If  the  test  is  continued  without  failure  for  a  time  which  is  long  com- 
pared to  the  expected  life,  the  parameter  /Bfl,-  will  be  large.  The  asympto- 
tic expansion  for  K  x  (2  V/3$i)  may  be  used.   Since 


^k«M-(41/1.- 


Z-»oo 


we  find,  for  large  /30J , 


Lim_        2VW  K^y/W?  =  ^(p*,)1'4  e~2^9i 

2//30i->oo  ft 


and  find  for  <L>,  from  Equation  (39) 


<L>  =  — ^-lnV^i8-3/40r1/4  e~2/Wi 
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=i+^  =  i 


|-+  Vi60; 


If  \J~fiOi  is  large  compared  to  3/4  (as  it  must  be  for  the  asymptotic  expan- 
sion to  be  valid),  we  find: 


<L>  =  y/JTp  (39a) 


THE  MEANING  OF  CONFIDENCE  (IN  THE  ABSENCE  OF  FAILURES) 

The  parameter  L,  which  we  have  used  in  these  calculations  is  a  mental 
construct  which  we  introduce  into  our  equations  and  then  cancel  out.  To 
see  this  role  more  clearly,  let  us  consider  the  problem  of  assigning  a 
probability  to  the  statement: 

(6\  0,BX)  =  "The  device  will  run  for  a  time  6  without  failure  given 
that  we  have  tested  it  without  failure  for  time  #,,  that  jS 
is  what  we  knew  "  beforehand* '  about  the  life  character- 
istic,  and  X  represents  the  other  statements  about  the 
problem"   (The  nature  of  X  will  be  brought  out  in  suc- 
ceeding paragraphs). 
We  may  use  the  following  equation  for  any  interpretation  we  wish  to 
make  regarding  the  parameter  L,  for  the  equation  merely  represents  the 
use  of  Equations  (1)  and  (2): 

p(0|0,j8X)  =     J      p(0L|0,j8X)  dh 

all  L 

Using  Equation  (1)  inside  the  integral  we  have: 

p(0|0,j8X)  =     J     prefix)  p(L\6fiX)  dh 

all  L 

We  now  consider  that  the  statement  X  includes  the  statement,  "The  fail- 
ures are  distributed  according  to  a  Poisson  distribution  with  mean  value 
L."  This  statement  serves  to  give  a  meaning  to  the  parameter  L  and  to 
impose  upon  the  interpretation  of  the  data  a  probability  distribution  not 
suggested  by  the  evidence.  It  is  at  this  point  at  which  we  impose  our 
own  conceptual  framework  upon  the  problem.  Once  this  has  been  done, 
the  procedural  methods  are  straightforward.  As  we  shall  see,  there  are 
many  other  distributions  we  could  consider  and  the  problem  of  choosing 
among  them  cannot  be  considered  independently  of  the  nature  of  the  data. 
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If  L  has  the  meaning  we  have  given  it,  we  may  write, 

p(0|L0,j8X)  =  p(6\L) 

for  given  L,  the  fact  that  we  observed  0,  and  knew  about  j8  and  X  is  ir- 
relevant.   From  Equation  (21)  we  have, 

p{6\6fiX)  =    f      e~6/L   p{L\6fiX)  dL 

The  second  term  in  the  integral  is  related  to  the  problem  of  parameter 
estimation.  That  is,  we  have  introduced  the  parameter  L,  quite  extraneous 
to  the  data,  and  then,  by  the  process  of  integration,  cancelled  it  out  of 
our  final  estimate.   We  have,  from  Equation  (36)  and  (37), 

-J3l  -  di/L 

P(L\eipx)  = 


%jet/p  K1(2Vj8<9i) 
Therefore,  /  e     e v 

ir-—), 


P(e\e,px)  = 


I 


2yJ$JP  K&y/W} 


Using  Equation  (37)  to  evaluate  the  integral  we  find 

1/2 


K1(2y/p(d  +  ei ) 

) 


As  before,  if  /30,  is  large,  we  may  use  the  asymptotic  expansion  for  the 
Bessel  function  to  find: 

p(0\  6  fiX)  =  (l  +  — )        exp  [2 vP7  -  2  yfp(d  +  0,)  (40) 

Example:   Suppose  0/0,  =  0.1,  00,  =  10.   We  have: 

00  =  1 

P(0|0,j8X)  =  (1.1)1/4  e2(3. 162278    -   3.316625) 

=  (1.02)  exp  (-0.308694) 
=  1.02  x  0.735  =  0.75 
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We  interpret  this  example  as  follows.  If  we  have  a  device  for  which  we 
expected  a  life  of  one  year  (/3  =  1  yr"1),  the  probability  we  assign  before 
the  test  that  it  will  run  without  trouble  for  one  year  is  e~ 1  =  0.367.  If  we 
run  it  for  a  time  0,  =  10  yr,  i.e.,  j8#f  =  10,  we  raise  the  value  of  our  as- 
signment of  the  probability  that  it  will  run  a  year  without  failure  from 
0.367  to  0.75. 4  We  could  have  solved  the  previous  example  by  introduc- 
ing the  expected  value  of  L  explicitly.  By  the  theorem  of  the  mean  we 
may  write 

p(0|0fj8X)  =  p(0\<L>)  f     p(L 1 0,j8X)  dL 

The  integral  is  equal  to  unity.    From  Equation  (39a)  we  have: 

j3<L>  =  y/Wi  =  3.1622 
Therefore, 

p{6\<L>)=e-P6/3'1622  =e-i/3.i622  =  075 

In  either  method  the  value  of  L  cancels  out. 

Having  introduced  our  own  conceptual  framework  (i.e.,  imposing  a 
Poisson  distribution),  we  are  required  by  considerations  of  intellectual 
honesty  to  say  something  about  how  well  this  framework  actually  fits. 
Another  way  of  saying  this  is  to  give  our  estimate  of  the  "probability  of 
the  probability,"  that  is,  our  "confidence"  in  the  whole  procedure.  To 
do  this  we  adopt  the  following  procedure: 

1.  We  compute  a  value  of  L,  say  L',  such  that  the  probability  that 
L  exceeds  L'is  a  number,  which  we  call  the  "confidence."  Using 
this  number 

2.  we  compute  the  probability  that  0  will  be  exceeded  using  p(6\U) 


Editors  Note:  Dissatisfaction  with  this  development  is  most  easily  made 
specific  at  this  point.  It  appears  to  us  that  any  such  numerical  judgment  on  the 
probability  must  depend  on  the  original  strength  of  the  belief  in  the  value  of  jS, 
Was  the  assertion  that  /3  =  1  a  rough  guess?  Or  was  it  from  a  usually  reliable 
source?  Or  (extreme  case)  was  it  from  10,000  previous  experiments  whose  aver- 
age was  /3  =  1?  Author's  rebuttal:  An  assignment  of  /3  =  1  was  the  weakest 
statement  you  could  make  about  the  order  of  magnitude  of  the  expected  life  and 
still  say  something.  If  you  had  a  great  deal  of  data,  you  would  assign  a  sharp 
probability  distribution  to  L.  In  fact,  if  you  have  a  great  deal  of  data  about  L 
you  might  not  use  an  exponential  for  p(L|x). 
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We  may  express  the  second  procedure  in  the  following  two  equations: 


p{L'\6fiX)  =   Cv{L\OfiX)dL 


i 


e-/3L  -  ei/L  dL  (41) 


which  is  also  equal  to 


r 


e-fiL  -  ei/L  dL 


%jQJ£  K{2y/pet) 
Introducing  the  parameters  co  =  L/L ',  a  =  fiL,  and  b  =  p6n  we  find, 

f        —  aco  —  b/aco     ■, 

I    e  do 

p{L'\6fiX)  =  1  -  -^ —  <41a) 

2(y/b7o)  K1(2yjb) 

The  integral  is  evaluated  by  numerical  methods. 
If  we  define  the  function  in  the  above  equation  by, 

I  =  J(a,6)=  I(j8L',  £0,) 
so  that 

v(L'\6fiX)  =  i-npL',  pet) 

we  may  define  the  words  confidence  and  reliability  in  terms  of  this  func- 
tion as  follows: 

Let  C  denote  the  numerical  value  of  confidence.  We  find  that  value  of 
a  =  /3L  '(given  b  =  jSfl,)  for  which  the  function  J  satisfies  /  =  1  -  C.  Using 
this  value  of  L  '(=  a//3),  we  compute  the  reliability  R  from 

R=p{6\L')  =  e-e/L' 

The  smaller  C  is  taken,  the  larger  R  (for  a  given  6)  will  be.  That  is,  for 
a  given  desired  time  of  operation  without  failure,  we  may  quote  greater 
and  greater  reliabilities  with  less  and  less  confidence. 
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TESTING  WITH  FAILURES  (POISSON  DISTRIBUTION) 

Let  us  suppose  that  n  units  have  been  tested  to  failure  and  that  they 
have  run  for  times  t  v  t2,  t3,  tA,  ...,  tn.  That  is,  tlf  t^  t3,  ...,  tn  repre- 
sent the  times  at  which  the  first,  second,  third,  ...,  nth  units  failed.  Let 
the  symbol  E  represent  the  evidence,  i.e.,  E  =  "n  items  were  tested  and 
they  ran  for  times  tu  t2,  t3,  t4,  ...,  tn  at  which  times  they  failed." 

Using  Bayes  theorem: 

p(L|EX)=p(L|X)^§^  (42) 

p\tL\X) 

As  before  we  write 

p(E|X)  =  C  p{L\X)  p{E\LX)  dh  (43) 

For  p(E|LX)  we  write,  treating  the  tests  as  independent  events,  and  using 
Equation  (20) 

p(E|LX)  =  W£  e~(tl  ♦  '2  +  «»  +  U  +  ...  +  tn)/L 

If  we  set 

nt'  =tx+  t2  +  t3  +  t4  +  ...  +  tn 

we  have 

p(E|LX)  =  e~nt'/L  L~n  {dt)n  (44) 

Using  this  result,  and  substituting  Equation  (43)  in  Equation  (42)  we 
have,  using  the  minimally  prejudiced  value  [j8  dh  exp  (— j8L)]  for  p(L|X), 

p(L|EX)  =        e-^-("'^)L-  (45) 


f      e-/3L   -   {nt'/ L)  L-n  dL 


It  is  interesting  to  see  what  happens  to  the  integral  as/i->«  (for  any 
value  of  t '). 

If  a  graph  is  made  of  the  function  y  =  xe~x,  it  is  easy  to  see  that  it 
has  a  peak  at  x  =  1  and  falls  to  zero  to  either  side.  If  we  consider  the 
function   y  =  xn  e~xn  it  is  evident  that  this  function  becomes  more 
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"peaked*  *  around  x  =  1  as  n  becomes  large.  If  we  multiply  numerator  and 
denominator  of  Equation  45  by  £'n,  it  is  clear  that  as  we  permit  n  to  be- 
come indefinitely  large  the  term  [e~nt/L  (J'/L)"]  serves  to  select  that 
value  of  L  =  t\  Therefore,  regardless  of  the  prior  assignment  of  fi,  as  n 
approaches  °o  our  rules  of  logic  tell  us  to  assign  L  =  f,  the  mean  of  the 
sample  size.  The  choice  of  the  exponential  distribution  to  express  our 
prior  knowledge  serves  to  make  the  assignment  of  prior  probabilities  as 
"flat"  as  possible  and  still  be  consistent  with  what  we  actually  knew. 


TME  SHARING 

Suppose  we  run  two  units  without  failures.  Let  the  first  one  run  for 
time  0}  and  the  other  for  time  0;-  (i.e.,  the  evidence  consists  of  the  state- 
ment 0,0,).    We  find 

p(0i6j\L)  =  p(di\6jL)p{0j\L) 

If  the  tests  are  truly  independent  of  one  another,  and  if  L  is  the  mean 
life,  we  may  write, 

p(0,0,|L)  =  p(0,|L)  p(0,|L)  =  e-*6**0'* 

The  equation  demonstrates  that  it  is  not  necessary  to  accumulate  failure- 
free  time  on  one  device.  The  failure-free  times  of  all  devices  may  be 
added,  provided  that  the  Poisson  distribution  holds. 


TESTING  Y/ITH  BOTH  SUCCESSES  AND  FAILURES 

Suppose  we  have  evidence  of  the  form 

E  =  0162d3...0mt1t2t3...triX 

where  0,  =  "the  ith  unit  ran  for  time  0,  without  failure,' '  and  tj  =  "the 
jth  unit  failed  at  time  £;."   We  again  use  Bayes'  Theorem: 


P(L,EX)=P(L!XMEJUD 
p(E|X) 


We  have 

p(L|X)  =e~PLdL 
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p(E\LXhp(d162...dm\LX)p(t1t2...tn\LX) 

and 

p(E|LX)  =  e-(9l+92+-+e"')/L  e-("+'2  +-+'")/L  L-"dt" 
If  we  let 


and 


then 


md'=dl  +  02  +  ...  +  et 


nt  '=  tr  +  t2  +  t3  +  ...  +  tn 


p(E\LX)  =  e~m6l/L  e~nt'/L  L~n  dtn 
We  have,  as  before, 

Joe 
p{L\X)  p{E\LX)  dL 
o 

Therefore, 

-J3L    -  [(m6'+nt')/L,]    T  _n 

p(L\EX)=         e  L 


o 

1 


e-/3L-[(me'  +  „,Vz.]L_ndL 


The  above  equation  shows  that  the  time  accumulated  on  all  units  (failed 
plus  unfailed)  should  be  added  together  in  reckoning  the  expected  life. 
However,  if  the  number  of  failed  units  is  large  (i.e.,  large  n),  the  addition 
of  the  time  of  unfailed  units  will  not  make  much  difference.  This  result 
accords  with  the  intuitive  notion  that  there  is  a  great  deal  more  informa- 
tion in  a  message  that  tells  about  a  failure  at  a  specified  time  than  in 
one  which  reports  a  trial  interrupted  before  failure. 


CALCULATING  THE  EXPECTED  LIFE  AFTER  SEEING   n  FAILURES 
AND  m  SUCCESSES  (POISSON  DISTRIBUTION) 

Suppose  that  n  units  have  been  run  for  a  total  time  nt '  so  that  t '  repre- 
sents the  average  time  to  failure.  Suppose  also  that  m  units  have  been 
run  without  failure  so  that  the  total  time  accumulated  on  these  units  is 
md\    Then  let  t "  —  nt '  +  m6'  =  the  total  time  accumulated  on  all  units. 
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Let  <Ln>  be  the  expected  life  after  seeing  n  failures.    From  Equation 
(45)  we  have: 


p(L|t"j8X)  = 


e-fiL    -    t     /L    L-n 

00 e-fa-t"/LL-ndL 


I 


and 


r 


•PL    -    t"/L  L-n+ldL 


•J3l  -  t"/L 


If  we  let 


'■■C 


L"n  dL 


PL    -t"/L 


L~ndL 


it  is  evident  that 

7        =-^JL 

n+1         dt" 

We  introduce  the  variable 

z  =  2(t"p)1/2 

dz  =  (p/t")1/2  dt"=  (20 /z)  A." 
to  find 

dt"     dz 
From  Equation  (37)  we  have: 

J0  =  2(ryi8)1/2K1<2Vn8) 
=  U/iS)K1U) 
Therefore, 


=  2K0(z) 


The  higher  order  integrals  are  found  by  use  of  the  recurrence  formula 
given  in  Jahnke-Emde  [8]. 
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-tj-[z-r>Kri(z)]=-z-7>Kv+1(z) 

CLZ 

12  =  -  (20/z)  dU/dz  =  2<2j8/z)  Kt(z) 

13  =  -  (2j3/z)  dl2/dz  =  2(2j8/z)2K2(z) 


/B-2C2j3/*)»-lICn-10,) 

In  view  of  the  equations  for  <Ln>  and  Jn  , 

<Ln>  =  k=LL=[t''/B)l/*  Kn-2(2Vffl77) 
and 

K„-i[2Vj8(ro0'+nn] 

The  above  equation  gives  the  maximum  likelihood  value  for  L.  Confidence 
tables  similar  to  an  earlier  example,  Equation  (41a),  could  be  computed 
from  these  relations. 

The  preceding  formula  for  <Ln>  is  not  suited  for  extrapolation  to 
large  n.     That  is,  the  asymptotic  expressions  for  the  Bessel  function 
cannot  be  used  for  the  case  in  which  n  becomes  large  if  n  represents  both 
the  order  of  the  Bessel  function  and  the  size  of  the  argument. 


SEQUENTIAL  TESTING  W/TH  FAILURES 
(TRUNCATED  GAUSSIAN  DISTRIBUTION) 

We  now  turn  our  attention  to  the  problem  of  deciding  upon  the  proper 
parameters  to  use  if  a  truncated  Gaussian  distribution  appears  indicated. 
From  Equations  (24)  and  (31)  we  have : 

—  (aj/4a2   +  a1ti   +  a2ti) 

p(ti\a1a2X)  =  2    &  Z M  (46) 


tt       1  —  erf  (a1/2a2/2) 
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Suppose  the  evidence  upon  which  we  are  to  base  our  estimation  of  para- 
meters is  denoted  by  En  and  is  defined  by  the  statement: 

En  =  "We  tested  n  units  and  they  failed  at  times  tv  t2,  t3,...,tn." 
We  let  the  sample  mean  time  and  standard  deviation  be  defined  by  the 
two  equations  (where  the  primes  refer  to  the  sample), 


i=l 

and 

i=  1 

From  these  last  two  equations  we  have: 


(47) 


and 


£   t*=n(t'2  +  (7'2)  (48) 


i=i 


Therefore,  the  probability  of  seeing  the  evidence  En  in  the  light  of  Equa- 
tions (46),  (47),  and  (48),  is: 

p(E,  L     -    Y,  _  S*a2i  *    n[a?/4a2  +  "1''  +  °*{t'2+  a'2)]  (AO"  (49) 


,n\a1a2X)  =  [™2\ — 

\  W  l-erf(a1/2a1/2)]n 


To  determine  the  proper  values  for  ax  and  a2  in  the  light  of  the  evidence 
En  we  begin  with  Bayes'  theorem: 

p(a1a2|E„X)=p(a1a2|X)^jg^  (50) 

Introducing  the  identity: 

p(En|X)  =      J    J  p(Ena1a2\X)da1da2 

al     a2 

and  using  Equation  (1)  we  have 

p(jE„|X)   =     J      J     p(ai«2|X)p(En|a1a2X)da1da2 
a-i       a2 
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which,  with  the  help  of  Equation  (49),  may  be  introduced  into  Equation 
(50)  to  produce  the  relation: 

P(a1a2|X)a^2[l-erf(a1/2a^2)]-"e-n[a'/4a2+ai^q2(tr2+a^)](51) 
f     f  p(a1a2|x)a§/2[l^rf(a1/2a1/2)r"  ^^^^l^^V2)]^^ 


As  n  becomes  very  large  it  is  evident  that  the  terms  which  represent 
p(En\a1a2X)  become  more  and  more  like  a  two-dimensional  5-function, 
that  is,  all  other  values  except  the  peak  values  contribute  less  and  less. 
The  function  serves  to  pick  out  that  pair  of  values  of  ax  and  a2  dictated 
by  the  evidence  En  and  uninfluenced  by  the  prior  probabilities  as  repre- 
sented by  p(a1a2\X).  It  is  only  when  n  is  a  small  number  that  the  prior 
probabilities  play  a  role. 

We  may  find  the  location  of  the  peak  in  the  a  xa  2  plane  by  looking  for 
the  unconditional  maximum  of  the  function: 

^=-logp(En|a1a2X) 
=  iln  a2  -  (al/4a2)  -  ajt'r-  a2{t'2  +  a'2)  (52) 

-  ln[l  -  erfla^a*/2)]  +  In  2  -  i  In  n 


We  find 


<¥  _     2x2  _r+      2xe-*2 


dai         ai  v^aid-erfx) 

where,  as  in  connection  with  Equations  (32),  (33),  and  (34),  we  have  in- 
troduced the  notation,  x  =  a/2aj/2.  Further,  again  using  the  notation 

_     2 

Z  =  -  x  -f  In  (1  -  erf  x)  =  — ?xe^ 

dx  y/7F  (1  -  erf  x) 

we  find,  upon  setting  d$/dax  =  0, 

ait'=f(x)=Z-2x2 


134  MYRON  TRIBUS 

We  also  find 

da2      a  \        a\  a\ 

and  putting  d<f>/da2  =  0, 

«?(r2  +  a'2)  =  2x2  +  4x4-2Zx2 
or 

a\o'2  =  2x2  +  4x4-2Zx2-a2t'2 
Using  Equation  (53)  and  the  relation  a2  =  a2 /Ax2  we  have 

We  find,  upon  comparing  Equations  (53)  and  (54)  with  Equations  (32)  and 
(33)  that  regardless  of  the  prior  probabilities,  as  expressed  by  p(a1a2\X), 
unless  these  probabilities  have  very  sharp  maxima  and  are  zero  at  partic- 
ular places  (and  it  would  take  unusual  evidence  to  make  them  have  this 
character  in  advance  of  the  testing  program),  as  the  sample  size  grows 
large  Bayes'  Theorem  directs  us  to  assign  the  values  of  a1  and  a2  so  that 
we  shall  agree  with  the  sample  mean  and  variance. 

SURVIVAL  PROBABILITIES  (TRUNCATED  GAUSSIAN) 

If  the  probability  that  a  system  fails  in  time  dt  at  t  is  p(^|a1a2X),  the 
probability  that  it  will  survive  for  time  6  is: 


p{6\axa2)  =     J     p{t\axa2)dt 


e 

f        -a-.t-a.2t 

J 

Jo 


2 

dt 


f        —alt—a2t 

J 

If  we  let  z  =  (a  J%{a2)  +  \fa2 1,  we  have 

z2  =  (a21/4a2)  +  <*it  +  a2t2 


2 

dt 
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This  result  suggests  that  we  multiply  numerator  and  denominator  by  exp 
-  (a?/4a2)  to  find 

f   e-(ai/4a2>  -  ai<  -  <V2  dt 
p(6\aia2)  =  A 

.2/  ._   .         _     .         _    .2 


f    e-(al/402)   -  CLlt  -   a^t- 


a2  6  +   (a1/2/a2  ) 
2 


e~z~  dz 

a1/2/a2 


i 


2 

e~2   dz 


a^/2  v  a2 

=  erf  {yja^d  +  a t/2 y^"2)  -  erf  (a i/2y^) 
=  l-ert(ai/2yfcT2) 

Letting  x  =  ax/2  a2  we  find, 

//,,         >.     erf  (aj(9/2x  +  x)  -  erf  (x) 
p{d\aia2)  = S- — 

1  -  erf  Oc) 


THE  INTERPRETATION  OF  THE  ENTROPY 

The  entropy  of  a  probability  distribution  is  found  from  the  definition 

S  =  -K^  PilnPi 

f 

For  the  exponential  distribution 


we  have 


Si  =  Si(L)  =  -K  £  Pi{-<zo  ~ «i'i) 


=  Ka0  +  Ken  <t> 
=  Ka0  +  KatL 
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But,  from  Equations  (17)  and  (18)  we  have,  for  small  Af, 

c^  =  1/L,  a0  =  ln(L/Af) 
Hence, 

S^L)  =  K+  Kin  L-K  In  M 

This  equation  tells  us  that  as  L/kt  becomes  large,  so  does  the  entropy. 
For  the  truncated  Gaussian  we  find,  by  similar  methods, 

S2iL,a)  =  K[x2  +  In  (1  -  erf  x)  +  fix)  +  g(x)  +  ifix)/2x)2 
-  In  a2  +— In  tt  -  In  2  -  In  A*] 

We  note  that  in  this  case  the  entropy  is  a  function  of  two  parameters,  L  and 
a.  These  parameters  have  been  represented  on  the  right-hand  side  of  the 
equation  by  the  functions  x,  fix),  gix),  and  a2,  which  can  be  found,  given 
L  and  a,  by  the  use  of  Figures  4  through  7. 

Both  expressions  for  entropy  contain  the  term  i—K  In  AO,  which  indi- 
cates that  as  Af  is  made  smaller  and  smaller,  our  uncertainty  increases. 
In  a  crude  way  we  may  say  that  if  we  are  content  to  measure  lifetimes  in 
whole  years  our  uncertainty  about  the  time  of  failure  is  less  than  if  we 
measure  time  in  microseconds.  It  is  certainly  easier  to  guess  what  year 
a  man  will  die  than  the  microsecond  at  which  he  will  expire.  This  property 
called  entropy  represents  a  state  of  knowledge,  not  a  state  of  things,  and 
it  varies  with  whatever  we  shall  find  to  be  the  limitation  on  our  knowl- 
edge. 

The  same  situation  arises  in  statistical  mechanics.  The  entropy  we 
assign  to  a  mol  of  a  gas  depends  upon  how  many  details  we  are  able  to 
include  in  our  description  of  its  molecular  and  atomic  structure.  For 
example,  we  are  forced  to  adjust  the  entropy  of  a  mixture  of  isotopes 
when  we  learn  a  method  for  separating  the  isotopes. 
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APPENDIX:      DERIVATION   OF   EQUATIONS  (9)  -  (14) 

Given: 

£p,   -  1  (Al) 

i 

£p,f,  =  <f>  (A2) 

i 

and 

X,PA   =  <*>  (A3) 

i 

where  p.  =  p(Xj  |  X),  f.  =  f(Xi\  and  gt  =  g{Xi). 
To  maximize: 

S  =  -K£p.lnPi  (A4) 


Differentiate  Equation  (A4)  with  respect  to  the  p.;  K  is  constant.  Set 
dS  =  0. 


rf(~f)=    X(lnpi  +  l)dPi   =  0 


(A5) 


Differentiate  Equations  (Al),  (A2),  and  (A3)  with  respect  to  the  pr  keep- 
ing xf,  </">,  and  <g>  constant.  Multiply  by  the  arbitrary  functions 
(Lagrangian  multipliers)  (a0  -  1),  au  and  a2: 

(a0  -  1)  £  dp.   =   0  (A6) 

ai£^Pi   =  °  (A7) 

i 

and 

a2  £  g.dpi   =   0  (A8) 

i 

Add  Equations  (A5),  (A6),  (A7),  and  (A8),  and  collect  terms: 

£  (In  p.  +  a0  +  ctjf x  +  a2g)  dp.    =  0  (A9) 
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To  guarantee  that  Equation  (A9)  will  be  satisfied,  regardless  of  the 
variations,  dp.,  the  parenthesis  is  equated  to  zero. 


or 


In   p.     =  -  aQ  -  aj.  -  a2g.  - 


p.  =  e 
Substituting  Equation  (A9)  in  (Al)  we  have: 

-a0  -  a1fi  -  a2gj  -  ...    _    ^ 


or 


Hence 


and 


i  i 

e'a°  ^e-a^-a^-'~   =  1 


a0 


V*  p~aif*  "  a2gi  "  * 


a0 


=   In  V  e 


t 


al'i  -  a2^i--. 


If  we  differentiate  Equation  (A13)  with  respect  to  au  we  have: 

*lfi  -  <*2&i  -  ... 


dg0 

da1 


Zf. 


__K       ~alfi  ~  a2£i  "  ••• 


Using  Equation  (A12)  we  have: 


dao  -a0 

=  e 


d  ax 


i^ 


al*i  ~  a26i 


\*  f  e~ a°  ~  ai f *  "  a2 *J  "  " ' 

i 

i 
=    <f> 


(A10) 
(All) 


(A12) 


(A13) 


(A14) 


(A15) 


(A16) 


(A17) 


(A18) 
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In  similar  fashion  we  find: 


a0 


=   <g> 


If  we  differentiate  Equation  (A17)  with  respect  to  a,  we  find: 


d 


a2  ^h    da, 


From  Equation  (All)  we  find: 


dPi         e~ao  -aiU-  a2&i 


da, 


dap 
da. 


-  f, 


Therefore, 


and 


=  Ptl<f>-f,\ 

-£?-  £</>f,p,-X>/,2 


d2a, 


</"2>    __    <f> 


Since  the  variance  is  defined  by  the  equation 
o2(f)  =  £Pi(f,.-<f>)2 

i 

=  ZpM-2<f>f>  +  <f>2) 

i 

=  £P/f  -  2<f>£p/i  +  <f>2£P, 


we  find: 
Hence, 


a2(f)    =   <f2>   -  <f> 


d2 


a0 


a2  if) 


(A19) 


(A20) 
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KAI    LAI    CHUNG 


The  Ergodic  Theorem 
of  Information  Theory 


Information  theory  has  been  called  a  one-theorem  theory,  the  coding 
theorem  of  Shannon  without  doubt  forming  the  climax  of  the  theory.  In  the 
proof  of  that  theorem,  however,  an  important  step  is  represented  by  Mc- 
Millan's theorem  [6].  This  result  is  of  a  more  general  character;  indeed, 
it  belongs  to  the  category  of  ergodic  theory,  although  it  is  of  a  rather 
peculiar  form. 

Let  us  state  this  theorem  as  given  by  McMillan.  Suppose  that 
{Xn,  -oo<rc<<x>}isa  metrically  transitive  stationary  sequence  (an  er- 
godic source),  where  each  Xn  takes  a  value  from  a  finite  set  (alphabet) 
icti,  l<i  <r\.  Let 

PK>,.»,  Oin-x)    =   P«ic  =  alV  0<k<n-l\  (1) 

Because  of  stationarity,  the  left  member  in  Equation  (1)  is  also  the 
probability  that  any  n  consecutive  Xk's  (signals)  take  the  values 
{aio,  ...  ,  ain_1\  in  the  order  given.  Shannon  had  the  good  idea  of  con- 
sidering p(X0,  ...  ,  Xn_i),  where  the  random  variables  X0,  ...  ,  Xn_i  of 
the  process  have  been  substituted  for  the  fixed  values  aiQ,  ...  ,  ain_v 

and  the  result  p(X0,  ...  ,  Xn_i)  is,  of  course,  a  random  variable.  Mc- 
Millan proved  that  there  exists  a  constant  H  such  that 

lim   i_  log  p(x0,  ...  ,  Xn.i)  =  -H  (2) 

n  -*oo     Jl 

where  the  limit  is  taken  in  the  sense  of  L  t -convergence  (convergence  in 
mean  of  order  one).  The  constant  H  is  simply  the  corresponding  limit  of 
the  mathematical  expectations: 

H  =    lim    —  8{-logp(X0,  ...  ,  Xn_!)}  (3) 

n->  oo     Jl 
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Since  the  mathematical  expectation  occurring  on  the  right  side  of  Equa- 
tion (3)  is  the  entropy  of  the  partial  source  |X0,  ...  ,  Xn_il  of  n  consecu- 
tive signals,  H  is  the  " asymptotic  entropy  per  signal* '  of  the  original 
source. 

If  we  write,  momentarily,  hn  for  n"1  log  p(X0,  ...  ,  Xn-x)  McMillan's 
result  can  be  written  more  succinctly  as 

lim   £i\hn-£(hn)\\   =  0  (4) 

This  asserts  that  the  observed  quantity  hn  well  approximates  its  theo- 
retical average  &(hn)  in  a  certain  sense.  Such  an  assertion  is  always 
desirable  in  probability.  The  best  way  to  appreciate  it  is  to  consider  a 
"trivial' '  special  case.  Suppose  the  Xn's  are  independent  and  identically 
distributed.  Then  they  surely  form  a  stationary  sequence;  the  zero-or-one 
law  implies  that  it  is  metrically  transitive. 
But  in  this  case  we  have 

p(X0 Xn-i)   =  ptto)  ...p(X„-i)  (5) 

Since  the  right  member  is  a  continued  product,  it  is  natural  to  take 
logarithms,  thus: 

n-l 


Ilogp(X0 X„-i)  ="=-£    P«») 

k  =  0 


The  sequence  of  random  variables  \p(Xk),  0  <  k  <  <»}  is  again  independent 
and  identically  distributed,  and  since  the  alphabet  is  finite, 

r 

Sip(Xk)l   =  £   p(a,)  log  p(0i)   =  -H0  (6) 

i=i 

exists  and  is  the  entropy  of  the  common  distribution  of  the  Xk's.  Thus 
the  classical  law  of  large  numbers  asserts  that 

lim    -  y    p(Xjk)   =  -H0  (7) 

n->oo    n     £—* 
k  =  0 

where,  according  to  Bernoulli,  the  limit  can  be  taken  in  the  sense  of 
convergence  in  probability  and,  according  to  Borel,  in  the  sense  of  almost 
sure  convergence  (convergence  with  probability  one).  As  an  afterthought, 
it  also  holds  in  the  sense  of  Li -convergence,  since  the  left  member  of 
Equation  (7)  is  bounded  in  n.  Thus  McMillan's  theorem  in  this  particular 
case  reduces  to  one  of  the  cornerstones  of  probability  theory. 
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It  turns  out  that  in  the  application  to  the  coding  theorem,  McMillan's 
theorem  is  not  needed  in  its  full  strength.  Convergence  in  probability  is 
sufficient,  and  that  is  implied  by  either  L 1  -convergence  or  almost  sure 
convergence.  This  is  a  usual  and  welcome  experience  in  mathematics: 
we  often  prove  more  than  we  need.  To  the  most  practical  minds  this  may 
represent  a  needless  waste,  but  then  a  large  part  of  human  civilization  is 
necessarily  a  waste.  To  others,  the  possibility  of  doing  more  than  is 
immediately  necessary  offers  not  only  a  challenge  to  perfectibility,  but 
also  a  promise  of  future  resources.  For  only  in  sharpening  our  tools  and 
improving  our  products  can  we  hope  to  strengthen  ourselves  for  the  next 
advance  forward. 

Breiman  [1]  was  the  first  to  prove  the  almost  sure  convergence  in 
Equation  (2).  His  paper  contained  an  error  which  was  corrected  by  him- 
self in  [2] .  We  know  that  in  general  almost  sure  convergence  and  L  x-con- 
vergence  are  not  comparable,  so  that  his  result  complements  McMillan's 
without  including  it. 

But  why  insist  on  another  kind  of  convergence?  Apart  from  the  cosmic 
reasons  just  mentioned,  there  is  a  specific  mathematical  reason.  In  many 
ways  the  concept  of  almost  sure  convergence  is  the  cleanest,  for  it  says 
that,  apart  from  a  set  of  sample  sequences  of  total  probability  zero  — 
which  we  can  throw  away  at  the  outset  —  there  is  plain  simple  numerical 
convergence  as  taught  in  any  text  on  infinite  sequences,  without  any 
further  references  to  measure  or  integration.  Explicitly,  if  the  sample 
point  w0  does  not  belong  to  the  discarded  set,  then 

±-\ogp[X0(wo),  ...  ,  Xn.i(ii;o)]  <8> 

as  a  numerical  sequence  actually  converges  to  the  constant  H  without 
any  indirection.  Formally,  .we  have  simplified  Equation  (4)  tc 

lim   \hn-S(hn)\   =   0 

provided  we  add  " almost  surely"  at  the  end.  Algebraic  manipulations  on 
such  sequences  as  Equation  (8)  are  just  those  with  numbers.  For  ex- 
ample, suppose  Nn  is  a  random  variable  such  that 

lim   2l  _  1 

n-»oo       n 

almost  surely;  Nn  may  depend  heavily  on  the  sequence  of  {Xn,-°o<7K°o} 
in  any  old  way.  Yet  from  the  almost  sure  convergence  in  Equation  (2) 
follows  without  any  ado  that  of 

jy-   logp(X0,  ...  ,Xn_i) 
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to  the  same  limit.  The  corresponding  assertion  cannot  be  made  for  con- 
vergence in  probability  or  Lx -convergence.1 

I  have  dwelt  on  the  foregoing  comments  because  of  prevalent  defensive 
attitudes  among  some  applied  mathematicians.  To  cite  one  instance,  van 
der  Waerden,  a  mathematician  turned  statistician,  expressed  the  view  [8] 
that  only  the  weak  law  of  large  numbers,  but  not  the  strong  one,  plays  a 
role  in  statistics.  It  is  interesting  that  as  good  a  pure  mathematician  as 
van  der  Waerden,  when  he  turned  to  applications,  would  abandon  his 
idealistic,  perfectionist  approach  and  switch  to  a  parochial  attitude 
toward  utility  or  relevance.  An  opinion  exactly  opposite  to  his  can 
be  readily  propounded  holding  that  in  reality  only  the  strong  law,  dealing 
with  a  clean  concept  as  explained  above,  has  a  truly  intuitive  meaning, 
while  such  other  notions  as  weak  convergence  are  only  make-do  half- 
measures  which  one  adopts  only  when  one  does  not  know  better!  I  quote 
from  my  review  [9]  of  the  cited  book: 

The  reviewer  takes  exception  to  the  remark  on  p.  98  that  the 
strong  law  of  large  numbers  "scarcely  plays  a  role  in  mathe- 
matical statistics.' '  This  is  like  saying  e.g.  that  Dedekind  cut 
scarcely  plays  a  role  in  numerical  analysis  (or  dynamics).  The 
point  is,  even  if  the  strong  law  is  meaningless  in  a  final  statisti- 
cal statement,  it  may  well  enter  into  an  argument  or  proof  which 
is  essential  to  the  statistical  conclusion,  just  as  the  real  num- 
ber system  is  surely  at  the  back  of  many  calculations  although 
the  IBM  machines  yield  nothing  but  terminating  decimals.2 
We  now  come  to  the  dormant  assumption  in  McMillan's  theorem:  the 
alphabet  is  finite.  This  is  such  a  common  hypothesis  in  information 
theory  for  a  discrete  channel  that  the  restriction  has  rarely  been  ques- 
tioned. In  practice,  of  course,  any  alphabet  is  finite.  Indeed,  the  whole 
observable  universe  is  finite  and  will  remain  so.  Recalling  that  a  rational 
number    corresponds  to   a  terminating  or  repeating,   hence  essentially 
finite,  decimal,  while  an  irrational  number  represents  an  infinite  decimal, 
the  finiteness  of  alphabets  results  from  the  fact  that  all  human  operations 
must  terminate,  so  that  all  recorded  messages  -  say  the  digits  of  a 
decimal  —  must  be  finite.  Just  as  the  realities  of  computing  machines 
should  not  bar  our  interest  in  real  numbers,  as  hinted  in  my  quotation 
above,  so  the  realities  of  signaling  machines  should  not  prevent  a  con- 
sideration of  an  infinite  alphabet.  After  all,  a  countable  set  contains 
finite  subsets  of  any  size,  but  no  finite  set,  however  arbitrary  or  large, 


••-This  is  not  an  artificial  example;  see  pages  91  and  93  of  my  book  [5]  for  a 
true  experience. 

o 
In  deference  to  the  present  audience,  may  I  add  Burroughs,  Sperry  Rand,  and 

other  machine  manufacturers? 
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can  begin  to  "approximate"  a  countable  one.  The  last  sentence  is  not 
meant  as  an  enunciated  truism;  it  is  aimed  at  those  deluded  pragmatists 
(of  whom  the  number  is  still  amazingly  large)  who  think  that  an  infinite 
set  is  just  an  "arbitrarily  large  finite  set!" 

Without  flying  off  along  these  tangents,  let  us  examine  just  one  anal- 
ogous situation.  The  theory  of  Markov  chains  was  also  originally  invented 
for  a  finite  number  of  states  -  in  fact,  exactly  two.  It  was  not  until  much 
later,  in  1936,  that  Kolmogorov  seriously  launched  the  countable  theory. 
Although  the  majority  of  applications  today  are  still  concerned  with  the 
finite  case,  there  does  not  seem  to  be  any  dispute  whether  the  countable 
theory  is  useful  or  relevant.  The  truth,  mathematically  speaking,  is  that 
the  countable  theory  is  much  more  elegant  and  deeper.  It  offers  a  good 
deal  that  is  new  in  conception  and  methodology.  It  is  an  instance  of 
generalization  which  proves  to  be  richly  rewarding. 

I  do  not  know  if  the  same  development  will  occur  in  information  theory. 
But  there  is  no  question  that  if  a  result  extends  from  a  finite  to  an  in- 
finite alphabet,  without  compensatory  loss  of  generality  or  precision, 
this  is  a  gain.  Such  an  extension  of  McMillan's  theorem  was  given  by 
Carleson  [3]  who  proved  that  Equation  (2)  holds  in  the  sense  of  L  i-con- 
vergence  for  a  countable  alphabet  {af,  1  <  /  <  °o},  provided  the  now- 
infinite  series  corresponding  to  Equation  (6)  is  convergent: 


H0   =  -^T    p(di)  log  p(di)  <<*> 


(9) 


This  generalization  is  not  trivial,  since  the  original  proof  cannot  be 
carried  over  without  an  additional  argument.  A  valid  criterion  for  gen- 
eralization, frequently  forgotten  by  authors,  is  that  it  should  add  some- 
thing new  and  not  merely  correct  the  oversight  of  a  previous  author. 
Admittedly,  sometimes  it  is  difficult  to  draw  a  fine  line  between  what  is 
really  new  and  what  is  just  overlooked  or  willfully  ignored,  but,  as  in 
all  human  judgments,  honorable  intention  and  common  sense  must  be  the 
ultimate  guide. 

Having  seen  both  Breiman's  and  Carleson' s  papers,  I  felt  there  was 
still  one  more  thing  to  be  done  —  namely,  combining  their  efforts  and 
proving  that  Equation  (2)  holds  in  the  sense  of  almost  sure  convergence 
for  a  countable  alphabet  satisfying  Equation  (9).  This  I  was  able  to  do, 
as  reported  elsewhere  [4],  There  is  no  point  in  going  into  the  details 
(which  are  not  difficult)  but  I  will  sketch  the  outline  and  allow  myself 
some  further  comments. 

Some  people  say  that  if  we  know  a  result  is  correct,  then  let  us  simply 
take  note  and  possibly  try  to  use  the  result.  Why  bother  with  a  proof 
(which  is  such  a  pain  to  follow)?  This  cookbook  approach  to  mathematics 
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is  quite  justified  in  many  situations.  Many  of  us  drive  an  automobile, 
but,  like  myself,  have  not  the  faintest  idea  of  its  mechanism.  But  when 
we  mean  to  do  research,  or  just  some  serious  thinking,  then  one  cannot 
really  understand  a  theorem  unless  one  has  some  idea  of  its  proof.  Ergo. 
The  first  step  in  our  proof  may  be  said  to  be  an  imitation  of  the  special 
case  considered  in  Equation  (5).  Although  in  general  p(X0,  ...  ,  Xn_i) 
cannot  be  split  into  simple  factors,  we  insist  on  splitting  it  as  follows: 

(Y  v         \  pWo)  P'X0,   Xi)  p(X0,   ...  ,   Xn-i)  nm 

pVA0,  ...  ,  An_ij   =  — - —  •  — — —  •••  — — uu; 

1  p(X0)  p(X0,  ...  ,  Xn_2) 


(11) 


I  logp(Xo,  ...  ,  Xn.x)   =  l£    log     fff»-"'X*\ 

where,  for  k  =  0,  p(X0,  ...  ,  Xfc_i)  =  1.  Let  us  put 
J.   =  (...  ,  X_i,  Xo,  Xi,  ...) 

MD  ■  -log    p,{vXo' ~ 'Xk\ 

g    (d)    =    -log— - —- 

p{X.k,  ...  ,  X-i) 
If  T  is  the  shift  operator  which  shifts  each  Xn  into  X„  +  i,  then 

fk(X)   =  gfc(T*I) 
thus 

-i-logp(X0,...,Xn.1)  -i-J    «k(T*3f) 

k  =  0 

Now  by  definition  if  X0  =  a,  then  gfc(30  becomes 

-logPlXo   =   a|X.!,  ...  ,  X.k} 

The  sequence  {PIXq  =  a  I  X-i,  ...  ,  Xk\,  1  <  k  <  <*>}  is  that  of  the  con- 
ditional probabilities  of  the  event  X0  =  a,  given  the  k  preceding  vari- 
ables. According  to  a  famous  theorem  of  P.  Levy,  the  precursor  of  the 
martingale  convergence  theorem,  it  tends  almost  surely  to  a  limit  as 
fe->oo.  Hence  the  sequence  \gk,  1  <  k  <  ~i  also  converges  to  a  limit 
random  variable  g.  An  easy  calculation  shows  that 


(12) 


^+1)^(^Ho<~ 


(13) 
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Hence    lim    S(g, )  exists,  and  so  by  Equation  (11)  the  limit  in  Equation 

(3)  also  exists.  It  will  turn  out  that  this  limit  is  equal  to  £(g). 

It  is  now  tempting  to  substitute  the  limit  g  for  all  the  gk's  in  Equation 
(12).  If  this  is  done,  then  the  almost  sure  convergence  of  the  resulting 
average, 


1    n_1 


is  precisely  the  assertion  of  G.  D.  Birkhoff  s  ergodic  theorem.  In  the 
metrically  transitive  case  the  limit  will  be  the  constant  S  (g).  It  remains 
to  prove  that  this  heuristic  substitution  is  justified,  leaving  no  visible 
effect  on  the  outcome.  Such  a  justification  is  required  because  the  two 
fe's  appearing  in  the  right  member  of  Equation  (11)  denote  the  same  index, 
and  there  is  no  reason  why  we  should  tackle  them  one  at  a  time.  This 
kind  of  situation  is  common  experience  in  mathematical  analysis;  the 
remedy  is  also  well  known:  to  establish  some  sort  of  uniformity  in  k. 
For  Li -convergence,  it  is  sufficient  to  show  that  the  sequence 
\gk,  1  <  k<  °oj  is  "uniformly  integrable,"  as  did  McMillan  and  Carleson. 
For  almost  sure  convergence,  it  is  sufficient,  following  Breiman,  to 
prove  that 

S{     sup      gJ<oo  (14) 

1  <  k<oo      * 

The  said  uniform  integrability  being  implied  by  the  condition  (14),  the 
proof  by  Breiman  and  the  writer  actually  covers  both  cases  of  conver- 
gence. The  proof  of  Equation  (14)  is  somewhat  technical  and  will  not  be 
repeated  here.3 

One  more  point  remains  to  be  made.  As  sketched  above,  the  ergodic 
theorem  of  information  theory  derives  from  two  of  the  most  famous 
theorems  in  probability,  namely,  the  individual  ergodic  theorem  and  the 
martingale  convergence  theorem.  Apart  from  its  meaning  for  information 
theory,  such  a  hybrid  is  usually  interesting  in  mathematics. 

This  opportunity  is  taken  to  correct  two  errors  in  [4] :  first,  the  hypoth- 
esis "metrically  transitive"  was  inadvertently  left  out;  second,  the 
definition  of  H  in  formula  (2)  was  incorrectly  given  as  the  H0  in  Equation 
(7)  here. 


Subsequently,  Mrs.  Moy  has  found  a  neater  proof  based  on  a  different  idea  [7], 
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Exploratory 
Mathematics  by  Machine 


This  paper  might  better  have  been  titled,  "A  Small  Step  Toward  Ex- 
ploratory Mathematics  by  Machine."  Specifically,  we  present  a  way  for 
testing  the  validity  of  an  expanding  truth-functional  formula  in  alter- 
national  normal  form.  The  relevance  of  such  a  test  (or  of  the  related 
consistency  test  for  a  formula  in  conjunctional  normal  form)  to  a  more 
general  theorem-testing  program  for  logical  and  mathematical  expressions 
has  been  extensively  argued  elsewhere  [1,2,3].  Nevertheless,  we  present 
a  few  general  remarks  for  the  benefit  of  the  reader  unacquainted  with 
the  problem. 

It  is  well  known  that  elementary  mathematical  formulas  can  be  recon- 
structed within  the  language  of  symbolic  logic.  Such  a  transformation  is 
not  accomplished,  however,  without  considerable  growth  in  the  size  of 
the  formula.  It  is  thus  theoretically  possible  to  investigate  certain 
mathematical  questions  indirectly  by  testing  purely  logical  formulas, 
though  of  a  much  larger  size.  The  extent  to  which  indirect  investigations 
such  as  this  can  have  true  mathematical  interest  is  not  clear  at  the 
present  time.  Nevertheless,  the  development  of  large-scale  computing 
machines  gives  added  hope  that,  in  the  future,  problems  whose  chief 
difficulty  is  sheer  size  can  be  managed  more  successfully.  Hence,  if 
effective  techniques  can  be  developed  for  bringing  the  full  power  of  the 
computer  to  bear  on  purely  logical  problems,  it  is  conceivable  that  some 
nontrivial  mathematical  questions  can,  in  turn,  be  resolved.  Many  prob- 
lems of  interest  primarily  to  the  logician  might  also  be  solved.  Suppose, 
however,  computer  techniques  for  managing  purely  logical  problems  prove 
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insufficient,  taken  by  themselves,  in  the  realm  of  mathematics.  Such 
techniques  might  still  constitute  a  necessary  part  of  the  more  expanded 
methods  by  which,  hopefully,  a  computer  might  test  significant  mathe- 
matical formulas  for  "theoremhood."  Hence,  we  have  set  ourselves  the 
problem  of  developing  efficient  computer  programs  for  testing  nontrivial 
logical  expressions. 

Quantification  theory,  sometimes  called  the  first-order  predicate  cal- 
culus, is  perhaps  the  cornerstone  within  logic.  Here  a  set  of  formulas  is 
obtained  from  two  basic  kinds  of  logical  particle:  truth-functional  and 
quantificational.  Truth-functional  expressions  are  composed  of  logical 
elements  such  as  AND,  OR,  NOT,  NEITHER-NOR.  Quantificational  ex- 
pressions combine  truth-functional  particles  with  two  additional  logical 
operators  —  namely,  the  existential  and  universal  quantifiers.  The  ex- 
istential quantifier  with  respect  to  a  variable  x  may  be  roughly  translated 
by  the  phrase,  "There  is  an  x  such  that...";  and  the  universal  quantifier, 
by  the  phrase,  "For  whatever  x  taken...."  Quantifiers  govern  fragmentary 
formulas  in  which  the  variables  they  have  specified  reappear  as  parts  of 
predicates  which  are,  in  turn,  brought  together  truth  functionally. 

Generally  speaking,  quantificational  languages  without  certain  special 
restrictions  are  formally  undecidable.  This  means  that,  if  we  are  pre- 
sented with  a  formula  of  the  language,  we  have  no  finite  procedure  which 
will  tell  us  in  every  case  whether  or  not  the  formula  in  question  is  valid 
within  the  system.  This  does  not  mean,  of  course,  that  we  cannot  prove 
validity  or  nonvalidity  in  many  cases.  There  is  a  fundamantal  result  due 
to  Herbrand  which  enables,  in  large  part,  the  validity  test  of  a  quanti- 
ficational formula  to  be  reduced  to  one  for  a  truth-functional  expression 
only.  Specifically,  he  has  shown  that  we  may  obtain  from  any  quantifica- 
tional formula  an  expanding  truth-functional  expression  which  at  some 
point  in  the  expansion,  usually  unknown,  will  become  valid  if  the  quanti- 
ficational formula  is  itself  valid  (see,  for  example,  [4]).  Hence,  one 
method  for  testing  a  quantificational  formula  is  to  examine  the  associated 
truth-functional  expansion.  Since  the  latter,  even  when  simplified,  be- 
comes quite  large  by  comparison  with  the  quantificational  formulas,  which 
are,  in  turn,  quite  large  by  comparison  with  the  mathematical  expressions 
they  may  represent,  it  is  important  that  computer  techniques  be  avail- 
able for  handling  truly  vast  truth-functional  expressions.  It  is  toward 
the  latter  objective  that  we  are  working  in  this  paper,  not  only  so  as  to 
have  the  techniques,  but  also  as  an  experiment  in  how  far  we  can  go 
along  these  lines. 

As  noted  above,  the  particular  problem  we  have  attacked  is  that  of 
testing  the  validity  of  an  expanding  truth-functional  formula  in  alterna- 
tion^ normal  form.  An  expression  is  in  the  latter  state  if  denials  are 
restricted  to  the   individual  truth-functional  variables,  no  connectives 
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other  than  AND  and  OR  are  used,  and  the  connective  AND  never  joins 
subexpressions  containing  OR.  It  is  possible  so  to  set  up  the  transforma- 
tion of  a  quantificational  formula  into  an  expanding  truth-functional  one 
that  the  latter  will  be  continually  presented  in  alternational  normal  form. 
As  we  shall  see,  we  can  then  take  good  advantage  of  the  normality  of 
the  expression. 


THE  BASIC  METHOD 

The  basic  method  we  propose  is  similar  to  the  one  set  forth  in  our 
earlier  paper  with  Sward,  in  that  the  fundamental  technique  is  one  of 
variable  elimination  [53.  Advantage  is  also  taken  of  simplifications  re- 
sulting from  variables  in  partial  state.  As  noted  by  Putnam  and  Davis, 
simplifications  also  follow  from  single  literal  clauses  [3,  p.  209].  Pro- 
gramming of  the  method  on  the  IBM  7090  is  not  yet  completed;  but,  be- 
cause of  the  deadline  imposed  by  the  editors,  we  offer  this,  perhaps 
premature,  account  of  it. 

The  technique  set  forth  by  Putnam  and  Davis,1  although  formidable,  to 
our  mind  suffers  certain  disadvantages.  First  of  all,  the  "consolidation" 
which  takes  place  after  each  variable  elimination  will  delay  the  un- 
masking of  a  nonvalid  formula  [3,  pp.  210-211].  Second,  after  each  ad- 
dition of  new  clauses  by  the  Herbrand  expansion,  or  otherwise,  the  total 
expression  must  be  tested  from  " scratch,* '  [3,  pp.  211-212].  Third,  the 
computer  is  asked  to  carry  out  a  variety  of  truth-functional  operations 
which  are  probably  "slow"  computerwise. 

We  have  found  it  difficult  to  avoid  the  above  limitations  without  some 
loss  elsewhere.  We  hope,  however,  to  have  gained  by  the  transaction. 
Specifically,  we  eliminate  variables  throughout  the  entire  solution  in  a 
specified  order.  This  means  that  simplifications  resulting  from  variables 
in  partial  state  or  from  single  literal  clauses  must  wait  until  the  variable 
in  question  is  eliminated. 

We  do  not  anticipate  undue  space  problems.  Because  of  the  generally 
systematic  way  in  which  the  problem  is  handled,  we  believe  the  neces- 
sary use  of  tapes  can  be  made  on  large  problems  without  great  loss  of 
time.  One  final  comment  might  be  made  before  we  present  a  rough  state- 
ment of  the  method.  Our  planned  approach  is  somewhat  empirical.  Certain 
questions  are  left  open  —  for  example,  how  the  variables  will  be  ordered 
initially,    or  how  many  clauses  will  be  added  between  trials.  These 


The  problem  they  actually  consider  is  that  of  an  expression  in  conjunctional 
normal  form.  We  have  recast  the  argument  to  be  applicable  to  the  alternational 
case. 
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questions  can  best  be  resolved  by  trying  out  the  program  on  proposed 
problems  this  way  and  that. 

Suppose  we  are  presented  with  the  following  matrix: 


a1 

«2 

«3 

«4 

1 

1 

X 

0 

X 

2 

X 

0 

[1] 

X 

3 

X 

X 

X 

X 

4 

0 

1 

0 

1 

5 

X 

1 

X 

X 

The  horizontal  rows  represent  the  clauses  of  the  alternational  expression. 
The  columns  represent  in  an  obvious  manner  which  variables,  denied  or 
undenied,  occur  in  which  clauses.  The  brackets  around  the  positive 
occurrence  of  a 3  in  clause  2  indicate  a  terminal  occurrence,  in  that  no 
variable  occurs  therein  thereafter.  The  computer,  in  effect,  passes  through 
an  abbreviated  version  of  this  matrix  from  top  to  bottom,  column  by 
column.  Sometimes,  as  we  shall  see,  it  must  return  to  a  point  already 
passed,  but  in  a  different  connection. 

Suppose  our  original  matrix  had  50  clauses.  We  use  the  information  in 
the  column  under  at  to  channel  clauses  as  we  eliminate  ax.  Consider 
the  following: 

rU,2,  3,  4,  5, 50]-x 

[1,2,3,5,.  .  .]  [2,3,4,5,  .  .  .] 


We  may  conveniently  call  the  bracketed  collections  of  clauses  traps. 
Clause  1  does  not  end  up  in  the  lower  right  trap,  nor  clause  4  in  the  left. 
In  point  of  fact,  clause  1  is  different  by  the  absence  of  ax  in  the  lower 
trap  from  the  upper;  but  this  change  need  not  be  recorded.  We  still 
possess  all  the  information  necessary  to  detect  demonstrably  valid  or 
nonvalid  traps.  A  nonvalid  trap  is  one  which  is  empty,  without  clauses. 
A  valid  trap  results  from  the  elimination  of  a  terminal  variable.  For  ex- 
ample, suppose  we  were  eliminating  a3  from  the  following  trap: 


[1,2 


We   would  normally  obtain  the   two  resultant  traps  [2, ]   and 

[  1, ]  but  the  former  of  these  we  know  to  be  valid,  since  a3  is 

the  terminal  variable  for  clause  2,  and  in  turn,  as  we  shall  say,  governs 
the  trap  in  question.  It  should  be  noted  that  clauses  are  transmitted  to  a 
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lower  trap  either  actively  or  passively.  The  latter  occurs  when  the 
eliminated  variable  is  absent  from  the  clause.  By  detecting  cases  in 
which  no  clause  is  transmitted  actively  to  a  given  trap,  we  can  determine 
the  occurrence  of  a  variable  in  partial  state  and  act  accordingly. 

Before  continuing  further,  we  note  that  such  operations  as  the  chan- 
neling of  numbers  are  more  speedily  handled  on  the  typical  computer  than 
familiar  truth-functional  operations.  It  is  not  apparent  at  this  point  in  the 
discussion,  however,  how  we  shall  avoid  starting  from  "scratch"  when 
new  clauses  are  added  by  a  Herbrand  expansion,  or  otherwise.  Suppose 
the  expression  proves  nonvalid.  New  clauses  are  added.  Without  redoing 
the  work  already  done,  we  might  by  a  slightly  more  complicated  procedure 
"update"  the  problem  by  channeling  the  new  clauses  to  the  point  earlier 
reached  at  which  the  nonvalidity  appeared,  and  then  continue  in  the 
regular  mode.  The  chief  complication  is  the  fact  that  some  variables 
earlier  treated  as  being  in  partial  state  may  have  lost  that  property 
through  the  addition  of  new  clauses.  As  will  be  seen,  however,  this 
problem  is  easily  managed  by  appropriate  bookkeeping. 

One  final  observation  should  be  made  before  we  present  detailed  rules 
of  procedure.  In  addition  to  the  clause-variable  matrix,  a  record  must  be 
kept  of  the  problem  to  date.  We  must  know  not  to  explore  regions  already 
resolved,  and  which  avenues  remain  open.  To  keep  the  size  of  this  record 
small,  to  facilitate  bookkeeping,  and,  also,  to  expedite  the  determination 
of  nonvalid  formulas,  we  proceed  in  a  basically  serial  manner.  If  each 
variable  is  taken  as  determining  a  level  (beyond  which  it  does  not  occur), 
we  explore  only  one  path  at  a  time  per  level,  and  add  the  appropriate 
entry  regarding  the  other  path  to  the  record.  Such  an  entry  is  said  to  be 
held.  Thus,  at  any  one  moment,  the  record  will  have  no  more  entries  than 
the  number  of  levels  above. 


RULES  OF  PROCEDURE 

In  all,  there  are  three  modes  of  procedure:  descent,  ascent,  and  up- 
dating. Descent  is  the  normal  channeling,  or  bookkeeping,  operation  as 
we  explore  downward.  Ascent  occurs  when  the  region  under  scrutiny  has 
proved  valid  and  we  return  to  some  earlier  entry  to  be  examined.  We 
update  after  the  formula  has  proved  nonvalid  and  new  clauses  have  been 
added.  In  effect,  we  carry  these  new  clauses  through  the  paths  already 
explored.  The  entries  held  are  of  two  kinds:  boxes  and  doors.  A  box  is  a 
trap  (a  collection  of  clauses)  which  may  possibly  serve  as  the  path  of  a 
future  descent.  A  door  is  a  barrier  to  future  descent,  since  the  area 
covered  has  already  been  shown  valid. 

The  rules  for  descent,  ascent,  and  updating  are  as  follows: 
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Descent 

CASE  1.  There  are  no  terminal  variables  and  both  paths  receive 
clauses  actively  —  for  one  path  hold  a  box  one;  explore  the  other  path. 

CASE  2.  There  are  no  terminal  variables  but  only  one  path  receives 
clauses  actively  —  explore  the  path  which  receives  all  clauses  passively 
and  hold  a  box  two  for  the  other  path. 

CASE  3.  Neither  path  receives  clauses  actively  -  hold  a  box  three 
for  one  path  and  explore  the  other. 

CASE  4.  There  are  no  terminal  variables,  but  one  path  yields  an 
empty  trap  -  hold  a  box  four  for  the  nonempty  path,  terminate  descent 
and  obtain  new  clauses  (by  the  Herbrand  expansion,  or  otherwise;  up- 
dating will  follow). 

CASE  5.  One  path  is  governed  by  a  terminal  variable  —  hold  a  door 
one  for  that  path  and  explore  the  other. 

CASE  6.  Both  paths  are  governed  by  a  terminal  variable  -  terminate 
descent  and  commence  ascent. 

CASE  7.  One  path  is  governed  by  a  terminal  variable  but  the  other 
yields  an  empty  trap  -  hold  a  door  two  for  the  former  path,  terminate 
descent  and  obtain  new  clauses. 

Ascent 

We  proceed  backward  up  the  chain  of  entries  in  the  record  until  a  box 
one  is  encountered.  At  this  point,  we  hold  a  door  one,  all  the  lower 
entries  being  dropped,  and  recommence  descent  through  the  set  of 
clauses  contained  in  the  box  one.  If  no  box  one  is  encountered  during 
ascent,  the  total  expression  is  valid. 

Updating 

In  the  updating  process,  we  complete  a  record  entry  by  adding  the 
appropriate  new  clauses.  We  continue  an  active  path  by  channeling  the 
appropriate  new  clauses  down  a  path  previously  explored.  We  see,  there- 
fore, with  the  exception  of  a  box  three,  record  entries  must  also  indicate 
which  of  the  two  possible  paths  was  held. 

CASE  1.    With  no  terminal  variables,  a  box  one  is  met  —  complete  and  I 
hold  the  box  one  and  continue  the  active  path. 

CASE  2.  As  in  Case  1  (no  terminal  variables),  a  box  two  is  met  — 
complete  the  box  two;  (a)  if  there  are  no  clauses  actively  added  to  the 
active  path,  hold  the  box  two  and  continue  the  active  path;  (b)  otherwise, 
convert  the  box  two  to  a  box  one  and  hold,  continuing  the  active  path. 

CASE  3.  As  in  Case  1,  a  box  three  is  met  —  (a)  if  there  is  no  active 
addition  to  either  path,  complete  the  box  three,  hold,  and  continue  the 
active  path;  (b)  if  there  is  active  addition  to  one  path  only,  complete  that  I 
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path,  hold  as  a  box  two,  and  continue  the  other  path;  (c)  otherwise,  com- 
plete and  hold  as  a  box  one  either  of  the  paths  and  continue  the  other. 

CASE  4.  As  in  Case  1,  a  box  four  is  met  —  complete  the  box  four; 
(a)  if  clauses  are  added  to  the  opposite  path,  but  only  passively,  convert 
the  box  four  to  a  box  two,  hold,  terminate  updating,  and  commence 
descent  of  the  completed  opposite  path;  (b)  if  clauses  are  added  actively 
to  the  opposite  path,  convert  the  box  four  to  a  box  one,  hold,  terminate 
updating,  and  commence  descent  of  the  completed  opposite  path;  (c) 
otherwise,  hold  the  completed  box  four,  terminate  updating,  and  obtain 
new  clauses. 

CASE  5.  As  in  Case  1,  a  door  one  is  met  —  hold  the  door  one  and 
continue  the  active  path. 

CASE  6.  As  in  Case  1,  a  door  two  is  met— (a)  if  no  clauses  are  added 
to  the  opposite  path,  hold  the  door  two,  terminate  updating,  and  obtain 
new  clauses;  (b)  otherwise,  convert  the  door  two  to  a  door  one,  hold, 
terminate  updating,  and  commence  descent  of  the  completed  opposite  path. 

CASE  7.  One  path  is  governed  by  a  terminal  variable,  a  box  one  is 
met  —  (a)  if  the  variable  governs  the  path  held,  convert  to  a  door  one, 
hold,  and  continue  the  active  path;  (b)  otherwise,  hold  a  door  one  for  the 
governed  path,  terminate  updating,  and  commence  descent  via  the  com- 
pleted path  previously  held. 

CASE  8.   As  in  Case  7,  a  box  two  is  met  —  proceed  as  in  Case  7. 

CASE  9.  As  in  Case  7,  a  box  three  is  met  —  hold  a  door  one  for  the 
governed  path,  and  continue  the  other  path  as  the  active  path. 

CASE  10.  As  in  Case  7,  a  box  four  is  met  —  (a)  if  the  variable  gov- 
erns the  path  held  and  no  clauses  are  added  to  the  opposite  path,  con- 
vert the  box  four  to  a  door  two,  hold,  terminate  updating,  and  obtain  new 
clauses;  (b)  if  the  variable  governs  the  path  held  and  clauses  are  added 
to  the  opposite  path,  convert  the  box  four  to  a  door  one,  hold,  terminate 
updating,  and  commence  descent  of  the  completed  opposite  path;  (c) 
otherwise,  proceed  as  in  (b),  Case  7. 

CASE  1 1 .  As  in  Case  7,  a  door  one  is  met  —  (a)  if  the  variable  gov- 
erns the  path  held,  hold  the  door  one  and  continue  the  active  path;  (b) 
otherwise,  terminate  updating  and  commence  ascent. 

CASE  12.  As  in  Case  7,  a  door  two  is  met  -  (a)  if  the  variable  gov- 
erns the  path  held  and  no  clauses  are  added  to  the  opposite  path,  hold 
the  door  two,  terminate  updating,  and  obtain  new  clauses;  (b)  if  the 
variable  governs  the  path  held  and  clauses  are  added  to  the  opposite 
path,  convert  the  door  two  to  a  door  one,  hold,  terminate  updating,  and 
commence  descent  of  the  completed  opposite  path;  (c)  otherwise,  termi- 
nate updating  and  commence  ascent. 

CASE  13.  Both  paths  are  governed  by  a  terminal  variable  —  terminate 
updating  and  commence  ascent. 
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CASE  14.  No  clauses  remain  to  be  channeled  (a  box  four  or  door  two 
has  not  yet  been  encountered)  —  terminate  updating  and  obtain  new 
clauses. 


THE  METHOD  EXTENDED 

In  the  present  paper,  we  have  not  considered  questions  beyond  the 
domain  of  truth-function  theory.  As  is  evident  from  our  earlier  general 
remarks,  there  is  little  doubt  many  such  questions  must  receive  effective 
treatment  if  the  computer  is  to  be  used  eventually  as  a  powerful  theorem 
tester  for  logical  and  mathematical  formulas.  Nevertheless,  it  is  reason- 
able to  suppose  that  for  ultimate  success  we  shall  require  powerful 
techniques  for  manipulating  and  testing  very  large  truth-functional  ex- 
pressions. Because  of  the  result  we  have  cited,  due  to  Herbrand,  by 
which  a  quantificational  formula  is  reduced  to  an  expanding  truth-func- 
tional one,  the  particular  problem  we  have  attacked  here  seems  central, 
not  only  as  an  aid  to  future  problem-solving,  but  also  as  a  device  which 
should  help  us  in  the  search  for  greater  problem-solving  acumen.2 

The  method  proposed  represents  the  lineal  descendant  of  our  earlier 
Paris  paper.  What  we  have  tried  is  to  take  advantage  of  the  normality  of 
the  expression  so  as  to  handle  not  only  very  much  larger  cases  but  also 
cases  which  are  not  static,  but  expanding.  In  this  connection,  then,  have 
we  gone  as  far  as  we  can  go?  Well,  certainly,  as  suggested  earlier,  there 
are  a  number  of  open  questions  which  we  can  perhaps  best  see  how  to 
handle  by  trial  and  error  in  using  the  method.  Beyond  such  questions, 
however,  is  there  something  more  we  might  do?  Probably,  yes.  The  ex- 
pressions which  result  from  a  Herbrand  expansion,  censored  or  not,  will 
most  likely  prove  quite  loose  in  the  sense  that,  for  valid  formulas,  a  very 
small  subset  of  the  total  clauses  will  be  sufficient  for  the  over-all 
validity  of  the  formula.  What  we  need  to  do,  then,  is  to  take  proper  ad- 
vantage of  the  great  looseness  of  the  expanding  alternational  formulas 
we  shall  probably  encounter. 

To  accomplish  this  aim,  we  have  in  mind  two  basic  approaches.  One 
of  these  is  a  nuts-and-bolts  extension  to  the  method  as  already  explained, 
the  rough  details  of  which  have  been  worked  out,  and  which  will  be  pro- 
grammed in  the  immediate  future.  The  other  is  a  more  speculative  tech- 
nique based  upon  the  extended  method  proposed  here,  but,  perhaps,  very 
much  more  powerful.  We  cannot  claim  to  have  solved  the  basic  problem 
necessary  to  make  this  speculative  technique  work,  but  it  does  not  ap- 
pear to  us  to  be  insurmountable. 


For  other  papers  pertinent  to  the  general  problem  see  [6]  through  [l2] 
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Let  us  consider  first  the  nuts-and-bolts  extension.  As  we  enter  upon  a 
new  problem,  we  probe  downward  through  essentially  unknown  territory 
until  a  patent  validity  or  nonvalidity  causes  us  to  ascend  or  seek  new 
clauses,  as  the  case  may  be.  Because  we  have  gone  directly  "to  the 
bottom,"  so  to  speak,  and,  after  each  ascent,  probe  downward  again, 
nonvalidities  have  an  excellent  opportunity  to  appear  early.  Suppose, 
however,  we  are  working  on  a  valid  expression.  Our  various  excursions 
downward  reveal  more  and  more  about  the  structure  of  the  problem.  The 
presence  of  certain  clauses  at  certain  levels  is  enough  to  ensure  validity. 
Consider,  for  example,  the  following  expression: 

[12  3  4  5  6    ] 

a  V   a  b  c  V   a  b  c   V   ace   V   ad  e   V   ad  e 

If  the  clauses  are  numbered  in  the  manner  indicated  and  the  variables  are 
eliminated  alphabetically,  the  following  diagram  shows  the  history  of 
our  initial  descent: 


(©) 

door  one 


r [1   2   3  4   5   6]— | 

^2   3   4  5   6] 

r    " ' 


(©5  6) 

door  one 


r 

(4  5) 
box  one 


3]  (3  4  5  6) 

\  box  one 

[4   5   6 


[4  6] 
valid 


Normal  parentheses  enclose  the  entries  held.  Clearly,  the  presence  of  4 
and  6  at  level  e  guarantees  validity.  As  we  continue  working  the  prob- 
lem, we  obtain  the  following  register,  showing  sufficient  conditions  for 
validity  at  the  various  levels. 


a 

(1, 

2,   3,   4,   5,   6) 

b 

(2, 

3,   4,   5,   6) 

c 

(2, 

4,   5,  6)  (3,  i 

d 

(4, 

5,   6) 

e 

(4, 

5  )  (4,   6  ) 

4,   5,   6) 


The  register  would  be  formed,  of  course,  during  ascent.  Were  the  problem 
just  considered  part  of  a  very  much  larger  complex,  we  might,  by  proper 
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use  of  the  register,  avoid  exploring  paths  which  contain  known  ingredi- 
ents of  validity.  Although  the  register  might  become  quite  large,  we 
would  need  to  consider  only  the  entries  at  the  appropriate  level.  The 
comparison  of  traps  with  entries  should  not  go  slowly,  since  we  would 
exclude  an  entry  at  the  first  recalcitrant  clause.  Let  us  now  take  a  some- 
what looser  example,  still  eliminating  variables  in  alphabetical  order: 

[1  2        3        4        5] 

abVaBvcVcVc 


-U 


(13  4   5) 
box  one 


[3   4   5] 

\ 

valid 


*i 


■{2   3  4  5 


(©  3  4  5) 

door  one 


On  reflection,  we  see  that  the  following  register  is  complete: 


(  3  -  5,  4  ) 


By  hyphenating  clauses  3  and  5,  we  indicate  that  either  is  adequate 
without  the  other.  Although,  on  first  encounter,  we  do  not  recognize  the 
patent  validity  inherent  in  clauses  3,  4,  and  5  until  the  variable  c  is 
eliminated,  the  use  of  the  register  will  prevent  future  tardiness. 

The  extent  to  which  we  may  profitably  employ  registers  (worked  out  in 
somewhat  more  elaborate  detail)  can  best  be  determined  by  experiment. 
It  may  be  they  will  prove  useful  only  at  specified  points  in  the  solution. 
Nevertheless,  as  we  gauge  the  matter  now,  a  rather  heavy  reliance  on 
them  is  called  for.  In  a  loose  problem,  the  clauses  leading  to  validity  at 
various  stages  should  prove  a  small  minority.  There  will  probably  be 
much  repetition.  The  systematic  method  of  solution  with  its  "built-in* ' 
bookkeeping  enables  us  to  retain,  without  great  effort,  a  running  history 
of  the  problem,  from  which  irrelevancies  are  largely  eliminated.  It  seems 
reasonable  that,  when  we  explore  regions  having  much  in  common  with 
regions  already  examined,  we  should  profit  from  our  earlier  experience. 
The  use  of  a  growing  register  seems  a  straightforward  way  of  accom- 
plishing this.  Indeed,  should  our  hopes  in  this  connection  prove  justified, 
our  speed  of  solution  will  increase  as  we  move  from  the  start  of  the 
problem  to  the  end,  and  our  over-all  problem-solving  machinery  will 
possess  a  quite  realistic  self-improvement  mechanism. 
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So  much  for  the  nuts-and-bolts  extension.  What  of  the  more  speculative 
procedure  mentioned  earlier?  Suppose  it  generally  happens  that  the 
transition  from  the  nonvalid  expanding  expression  to  its  valid  resolution 
is  abrupt  and  large,  rather  than  gradual  and  small.  That  is  to  say,  until 
certain  clauses  are  added  in  the  final  appendage,  the  expression  is  not 
"near"  validity.  Were  the  truth  table  produced,  it  would  contain  many 
zeros.  For  extremely  large  examples,  the  method  we  have  set  forth  should 
be  able  to  unmask  in  a  relatively  short  time  nonvalid  formulas  "far" 
from  validity.  Suppose  we  proceed,  then,  adding  new  chunks  of  clauses 
and  unmasking,  until  suddenly  a  transition  occurs.  The  expression  is  no 
longer  "far"  from  validity,  as  we  determine  by  a  series  of  probes.  We 
might,  for  example,  skip  high  up  into  the  record  and  descend  by  a  new 
path  to  ensure  that  our  inability  to  uncover  a  non-validity  is  not  due  to  a 
merely  local  phenomenon.  We  now  suspect  that  the  expression  at  hand 
is  valid  —  although,  if  so,  it  will  be  much  too  large  to  be  checked 
throughout.  We  believe  it  to  be  very  loose,  however.  Our  problem,  then, 
earlier  alluded  to  as  unsolved,  is  to  reduce  the  size  of  the  expression 
without  losing  the  ingredients  necessary  to  validity.  It  does  not  seem 
impossible  that  this  can  be  accomplished,  especially  if  we  can  determine, 
as  deletions  are  made,  whether  the  expression  has  lost  its  "nearness" 
to  validity.  Once  the  expression  has  been  reduced  to  a  manageable  size, 
we  can  test  it  outright.  Here  we  are  obviously  within  the  realm  of  hopeful 
speculation.  Still,  if  the  problems  to  be  dealt  with  are  difficult,  they  are 
also  interesting. 

We  wish  to  thank  Professor  Burton  Dreben  of  Harvard,  who  has  pro- 
vided many  helpful  suggestions.  In  particular,  Professor  Dreben  empha- 
sized to  us  the  fact  that  the  expressions  to  be  tested  are  both  expanding 
and  loose. 
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Bayesian  Statistics 


INTRODUCTION 

It  has  already  been  made  plain  during  this  symposium  that  statistics 
does  have  an  important  connection  in  principle  with  the  symposium 
topics:  decision  and  information. 

Statistics  today  is  especially  connected  with  decision,  because,  be- 
ginning a  decade  or  two  ago,  a  vigorous  school  of  statisticians  com- 
menced exploring,  the  thesis  that  statistical  inference  was  concerned 
with  decision  in  the  face  of  uncertainty;  see  for  instance  [122,  124,  129, 
188].  Statistics  thus  came  to  be  conceived  by  them  more  and  more  as  an 
economic  science.  Whether  this  concept  of  statistics  is  right  is  open  to 
question.  Some  people  would  say  that  it  is  too  inclusive,  that  there  is 
much  connected  with  behavior  in  the  face  of  uncertainty  that  is  not 
statistics  (Bartlett  in  [39]).  Another  vigorous  school  of  critics  finds  it 
much  too  narrow.  According  to  them,  decision  theory  is  all  right  for  the 
Kremlin  and  the  wicked  people  in  Wall  Street  but  not  for  honest  men  in 
the  pursuit  of  science  [7,  11,  64,  65].  For  me,  however,  decision  theory 
is  the  best  and  most  stimulating,  if  not  the  only,  systematic  model  of 
statistics.  Most  important,  it  has  had  great  interest  in  itself  and  it  un- 
doubtedly has  some  relevance  to  statistics,  however  statistics  is 
defined.  The  decision-theoretic  view  of  statistics  certainly  has  relevance 
to  us;  for  we  came  here  largely  to  talk  about  decision. 

"Information"  is  a  loose  term,  referring  sometimes  to  studies  of  what 
this  morning  was  called  entropy  by  Chung *  and  to  a  host  of  other  ideas 


♦The  preparation  of  this  lecture  was  supported  in  part  by  The  Institute  of 
Science  and  Technology  at  the  University  of  Michigan.  The  written  version  owes 
much  to  aid  from  several  friends,  especially  the  critical  reading  of  Leslie  Kish 
and  to  the  good  taste  and  alert  eye  of  Louise  Forsyth. 


Kai  Lai  Chung,  "The  Ergodic  Theorem  of  Information  Theory,"  this  volume, 
pp.  141-148. 


161 


162  LEONARD  J.  SAVAGE 

as  well.  In  connection  with  statistics,  the  term  particularly  brings  to 
mind  the  problems  of  communicating,  condensing,  and  handling  data. 
Although  information  is  also  vital  to  the  subject,  today  I  shall  be  em- 
phasizing the  decision-theoretic  aspects  of  statistics. 

My  title  is  "Bayesian  Statistics,"  a  rather  inadequate  name  for  a  new 
movement  in  statistical  theory  for  which  a  better  name  has  not  yet  been 
coined.  Those  of  us  who  adhere  to  this  movement  feel  that  it  represents 
a  great  step  forward  and  the  breaking  of  a  log  jam.  Statistical  theory 
seems  to  us  to  have  gone  about  as  far  as  it  could  go  under  the  particular 
constraints  that  it  has  set  itself;  a  time  for  re-examination  and  unshack- 
ling of  those  constraints  has  come. 

Hearing  that  there  is  a  new  movement  in  the  foundations,  or  theory,  of 
statistics,  people  sometimes  leap  to  the  conclusion  that  such  a  movement 
must  concern  only  specialists  and  have  no  immediate  practical  implica- 
tions. After  all,  it  would  seem  reasonable  to  suppose  that  a  generation 
of  bright  people  working  in  the  subject  would  not  overlook  major  oppor- 
tunities to  exploit  common  sense;  however,  the  unreasonable  seems  to 
have  happened.  One  of  the  most  surprised  in  this  respect  was  I,  who 
wrote  a  book  [154]  on  the  foundations  of  statistics  a  few  years  ago  in  the 
full  conviction  that  one  had  only  to  take  a  critical  look  at  what  was  going 
on  to  explain  why  all  was  well.  In  the  preface  I  promised  to  do  just  that, 
saying  that  the  book  would  be  a  failure  if  it  did  not  succeed  in  this 
object.  As  I  wrote,  I  became  increasingly  deaf  to  my  own  leitmotif;  the 
book  was  written  and  published  before  I  realized  that  my  promise  was 
unfulfilled  and  unfulfillable. 

I  have  told  you  that  the  Bayesian  movement  is  an  important  step 
forward,  and  it  is  also  in  a  way  an  important  step  backward,  for  it  returns 
to  a  path  that  was  long  neglected  and  maligned.  The  name  Bayesian 
enters  for  the  superficial  reason  that  we  make  much  use  of  Bayes' 
theorem.  This  theorem  (more  or  less  traceable  to  Thomas  Bayes  [14,  15]) 
is  a  true  and  trivial  mathematical  fact,  or  constellation  of  facts.  But  the 
theorem  had  found  little  use  in  modern  statistical  theory  for  reasons  that 
will  soon  be  plain.  In  self-evident  notation,  Bayes'  theorem  says  essen- 
tially this. 

Prob  (Hypothesis  I  Datum) 

_   Prob  (Datum  1  Hypothesis)  Prob  (Hypothesis) , 
Prob  (Datum) 

when  division  by  0  is  not  entailed.  As  a  theorem,  this  is  so  trivial  that 
it  would  be  hard  to  have  the  concept  of  conditional  probability  at  all 
without  seeing  the  truth  of  Equation  1. 
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The  frequent  use  of  Bayes*  theorem  is  but  a  symptom  of  the  present 
new  movement.  The  deeper  characterization  is  its  exploitation  and 
systematic  use  of  a  concept  of  probability  that  was  for  many  unheard  of 
and  for  almost  all  others  taboo,  until  recent  years.  This  is  what  can  be 
called  *  'personal  probability/'  or  "subjective  probability";  whatever  you 
call  it  somebody  has  a  valid  objection  against  the  name.  I  shall  say 
"personal  probability"  here  and  shall  of  course  define  it  in  due  course. 

It  is  necessary  to  be  rather  dogmatic  today;  to  finish  in  an  hour  or  so, 
I  simply  have  to  say  what  seems  to  me  the  truth  and  not  always  be 
voicing  the  qualification  that  this  is  just  my  opinion.  Once,  then,  for  all, 
this  is  just  my  opinion  or  at  best  that  of  a  small  group  of  people.  We 
think  that  we  are  right  and  reasonable  and  that  we  can  so  demonstrate, 
but  I  can  not  pretend  to  be  demonstrating  very  much  today.  I  can  only 
give  you  a  glimpse  at  what  the  shouting  is  about. 

I  am  under  another  disability,  a  most  heterogeneous  audience.  There 
are  among  you  engineers,  economists,  statisticians,  and  mathematicians. 
There  are  people  here  who  know  a  good  deal  of  statistics  and  people 
who  know  literally  none.  I  shall  do  my  best,  and  you  must  please  co- 
operate, putting  up  with  being  bored  at  parts  you  know  very  well  and  also 
with  not  understanding  a  sentence  here  and  there.  Nobody  need  fall  off 
altogether;  there  will  not  be  that  kind  of  tight  organization.  I  may,  though, 
sometimes  make  an  unexplained  allusion  for  the  benefit  of  just  those 
who  will  understand  it. 

My  talk  will  be  relatively  philosophical  or  abstract;  that  of  Raiffa2  on 
the  same  general  subject  is  more  concrete  and  more  concerned  with 
specific,  formal  problems.  We  both  hope  the  two  talks  will  complement 
each  other  and  that  this  talk  will  serve  to  some  extent  as  introduction  to 
his.  My  main  themes  are:  the  meaning  and  nature  of  personal  probability; 
other  views  of  probability;  an  illustrative  topic  in  statistics  called  simple 
dichotomy,  which  is  a  sort  of  small-scale  model  of  statistics;  a  topic 
that  I  call  "stable  estimation,"  one  of  the  most  satisfying  and  useful 
chapters  of  the  new  statistics;  and  various  topics  and  examples  to  bring 
out  clearly  that  there  is  something  new  and  practical  here,  that  Bayesian 
statistics  is  not  merely  a  matter  of  calling  old  things  by  new  names. 


PERSONAL  PROBABILITY 

Personal  probability  is  a  certain  kind  of  numerical  measure  of  the 
opinions  of  somebody  about  something.  It  will  be  the  right  approach  for 
each  of  you  to  think  that  you  are  the  person  under  discussion;  for  this 


o 
Howard  Raiffa,  "Bayesian  Decision  Theory,"  this  volume  pp.  92-101. 
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reason  Good  [72]  introduced  "you"  as  the  technical  term  for  this  person. 
Let  us,  to  begin  with  an  example,  consider  what  you  think  about  the 
weight  of  that  chair.3  There  is  no  trickery;  it  is  the  kind  of  chair  it 
seems  to  be,  and  it  weighs  something.  Suppose  (overlooking  some  points 
of  law)  that  I  were  to  write  a  contract  on  a  slip  of  paper  promising  the 
bearer  $10.00  if  this  chair  weighs  more  than  20  lb  and  to  offer  the  con- 
tract up  for  auction.  What  would  you  bid?  If  you  would  bid  as  much  as 
$5.00  I  would  say,  roughly  speaking,  that  you  regard  it  as  at  least  an 
even  money  bet  that  the  chair  weighs  more  than  20  lb.  If  you  would  pay 
just  $9.00,  I  would  take  this  as  meaning,  by  definition,  that  the  proba- 
bility of  this  event  is  exactly  9/10  for  you.  Thus,  the  personal  proba- 
bility of  an  event  for  you  is  the  price  you  would  pay  in  return  for  a  unit 
payment  to  you  in  case  the  event  actually  obtains;  in  other  words,  a 
probability  is  the  price  you  would  pay  for  a  particular  contingency. 

There  is  a  little  flaw  in  this  definition  of  personal  probability?  It 
collides  with  the  facts  of  life  about  large  sums  of  money,  that  is,  with 
utility  phenomena.  There  are  definitions  that  escape  this  flaw  [154],  but 
the  one  just  given  is  the  most  colorful  and  easy  definition  to  comprehend 
when  one  is  first  introduced  to  the  subject,  and  it  is  easy  to  employ. 

Of  course,  personal  probabilities  can  be,  and  often  are  expressed  in 
terms  of  odds.  But  I  thought  it  advantageous  to  put  it  in  terms  of  a  price. 
For  some  critics  feel  that  personal  probabilities  are  nonsense,  mysteri- 
ous, and  unreal,  and  I  hope  thus  to  bring  out  that  they  are  exactly  as 
nonsensical,  mysterious,  and  unreal  as  the  price  that  you  would  now  pay 
for  a  steak  or  a  trip  to  Chicago.  They  are,  in  fact,  prices  for  a  kind  of 
thing  that  we  do  buy  and  sell  every  day,  both  overtly  in  the  market  and 
implicitly  in  our  behavior. 

There  are  some  things  that  personal  probability  is  not;  these  are 
noticed  by  critics  of  the  concept  and  are  often  brought  forward  in  criti- 
cism, although  they  are  simply  facts  and  are  not  really  detrimental  to  the 
theory.  One  thing  personal,  or  subjective,  probability  is  not;  it  is  not 
objective.  There  is  no  such  thing  as  the  right  price  for  you  to  pay  for  the 
contingency  that  this  chair  weighs  more  than  20  lb.  Each  of  you  has,  in 
fact,  his  own  opinion.  Inasmuch  as  you  can  not  communicate  with  each 
other  sitting  where  you  are,  you  have  nothing  better  to  go  on;  if  we  were 
really  playing  the  game  for  money,  each  of  you  would  be  on  your  own, 
dependent  entirely  on  your  personal  judgment. 

Another  fact  that  is  sometimes  thought  to  be  a  serious  objection  is 
that  personal  probabilities  are  not  precise.  Who  can  say  exactly  what  he 
would  give  or  take  for  a  particular  contingency?  To  illustrate,  there  pre- 


3 
Ed.  Note:  Here  Savage  pointed  to  a  plastic  covered  armchair  on  the  platform. 

He  later  picked  it  up  and  held  it  with  one  hand. 
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sumably  is  some  weight  such  that  you  would  barely  give  $5.00  for  $10.00 
in  case  the  chair  should  exceed  that  weight.  If  I  were  to  propose  6  oz,  of 
course  everyone  here  would  be  happy  to  give  $5.00  for  $10.00  contingent 
on  the  weight  exceeding  6  oz.  As  I  go  up  from  6  oz  to  several  lb  and 
come  up  to,  let  us  say,  100  lb  for  the  weight  of  this  chair  that  I  have 
been  holding  in  my  left  hand,  there  are  no  more  takers.  Presumably,  then, 
there  is  for  you  a  weight  at  which  it  would  be  indifferent  for  you  to  offer 
$5.00  for  the  contingency.  Just  ask  yourself,  though,  and  most  of  you 
will  notice  that  your  median  weight  for  this  chair  is  hard  for  you  to  pin 
down;  you  have  a  sense  of  vagueness  about  it.  Some  people  see  the 
vagueness  phenomenon  as  an  objection;  I  see  it  as  a  truth,  sometimes 
unpleasant  but  not  to  be  escaped  by  a  new  theory. 

Another  feature  in  employing  personal  probability  is  that  it  imposes 
the  difficult  responsibility  to  be  honest  with  yourself.  It  is  difficult  to  be 
honest  with  one's  self  about  prices  generally.  I  have  a  car  to  get  rid  of, 
and  in  my  secret  heart,  I  think  that  if  somebody  offered  me  $300  for  it  I 
would  leap  with  joy.  Yet  I  can  imagine  suddenly  saying  to  myself,  "Gee! 
it  must  be  worth  at  least  $1,000.  I  just  wouldn't  take  $300  for  it."  It  is 
hard  to  think  back  realistically,  once  a  price  has  been  offered,  to  ask 
yourself  what  price  would  have  satisfied  you.  This  is  especially  true 
when  the  price  is  for  something  you  hadn't  actually  been  thinking  of 
buying  or  selling. 

And  so  it  is  with  probabilities.  You,  at  present,  may  think  some 
scientific  hypothesis  very  unlikely,  if  you  happen  even  to  think  of  it  at 
all.  It  seems,  for  example,  that  at  some  girls'  college,  success  was 
found  to  be  ostensibly  more  highly  correlated  with  the  personality  of  the 
student's  grandmother  than  with  anything  else  the  investigators  could 
find.  Before  the  data  brought  such  an  idea  to  your  attention,  you  would 
have  thought  it  pretty  preposterous,  if  asked.  But  after  the  data  begin  to 
point  to  it,  you  are  in  psychological  danger  of  using  the  data  twice. 
First,  since  the  data  suggest  the  grandmother  hypothesis,  they  do  give 
you  some  reason  to  believe  it,  but  you  are  likely  to  forget  that  you  really 
did  not  believe  it  one  little  bit  before  you  saw  the  data.  Second,  inducing 
you  thus  to  forget,  the  data  affects  what  were  supposed  to  be  your  pre- 
conceptions. It  takes  a  lot  of  self-discipline  not  to  exaggerate  the 
probabilities  you  would  have  attached  to  hypotheses  before  they  were 
suggested  to  you.  This  is  a  real  danger  in  the  application  of  Bayesian 
statistics;  one  that  requires  constant  vigilance,  and  probably  special 
devices,  to  guard  against. 

What  are  here  called  "probabilities"  are  not  generated  by  the  opinions 
of  just  anybody.  When  we  theorize  about  probabilities,  we  theorize  about 
a  coherent  person.  By  a  coherent  person  I  mean  pretty  much  a  person  who 
does  not  allow  book  to  be  made  against  him,  a  person  such  that  if  we 
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offer  him  various  contingencies  we  cannot,  by  some  sleight  of  hand,  sell 
him  such  a  bill  of  goods  that  he  will  be  paying  us  money  no  matter  what 
happens.  Should  you  offer  5  dollars  for  this  contingency,  6  for  that,  3  for 
the  other,  and  so  on  in  such  a  concatenation  that  when  all  the  chips  are 
down  you  will  have  to  pay  me  something  net  no  matter  what  happens, 
then  you  have  made  a  mistake.  Of  course,  real  people  do  make  mistakes. 
We  make  mistakes  all  the  time;  we -can  not  hope  to  avoid  them  entirely. 
We  can,  however,  deplore  mistakes  and  try  to  hold  them  to  a  minimum. 
So  we  look  into  the  conditions  of  coherency  for  a  system  of  probabilities. 
By  a  system  of  probabilities  I  mean  what  you  would  offer  for  all  con- 
tingencies under  discussion  in  a  given  context. 

Easy  theories  [42,  91,  154]  show  that  if  you  are  consistent,  your 
probability  function,  the  probability  that  you  associate  with  each  event, 
is  a  mathematical  probability  in  the  ordinary  sense.  That  is,  the  prob- 
ability is  non-negative,  the  probability  that  two  and  two  make  four  is  1; 
and  the  probability  that  one  of  two  mutually  incompatible  events  will 
occur  is  the  sum  of  the  two  probabilities  that  each  will  occur.  (Those  of 
you  who  know  about  countable  additivity  please  forget  it;  the  topic  is  too 
fancy  for  today.  [43,  50]) 

There  is  another  important  mathematical  feature  of  a  consistent  system: 
Conditional  probabilities  work  as  they  should.  A  conditional  probability 
is  what  you  would  pay  for  a  contingency  that  is  itself  contingent  on 
something  else.  Consider  an  example.  The  Smiths  have  just  had  twins. 
What  would  you  pay  for  $1.00  contingent  on  the  possibility  that  they  are 
identical  (or  look-alike)  twins?  Perhaps  about  30  cents,  especially  if  you 
know  something  of  the  biological  experience  in  this  area.  Then  0.3  is  the 
approximate  probability  for  you  that  the  twins  are  identical.  Suppose, 
however,  that  the  dollar  is  to  be  paid  to  you  contingent  not  only  on  the 
twins'  being  identical  but  on  their  also  both  being  boys,  and  suppose 
that  what  you  offer  to  pay  in  compensation  is  itself  contingent  on  their 
both  being  boys.  What  would  you  offer  then?  It  seems  reasonable,  and  it 
can  be  shown,  that  your  contingent  offer  should  be  what  you  would  offer 
outright  if  you  already  knew  that  the  twins  were  both  boys.  Since  this 
would  be  some  evidence  in  favor  of  the  twins'  being  identical,  you  should 
pay  more  than  the  30  cents  you  were  willing  to  pay  without  such  evi- 
dence; a  good  case,  in  fact,  can  be  made  for  about  50  cents.  In  short, 
the  (conditional)  probability  that  the  twins  are  identical  given  that  they 
are  both  boys  is  about  1/2  for  me  and  perhaps  for  you.  The  word  "  con- 
ditional" is  often  dropped,  especially  because  the  distinction  between 
conditional  and  unconditional  probabilities  is  only  relative;  put  differ- 
ently, probabilities  are  always  conditional  on  all  that  is  actually  known 
by  the  judging  person.  Writing  the  probability  of  the  event  A  given  the 
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event  B  as  Prob  (A  I  B),  we  are  led  by  considerations  of  coherency  to  the 
familiar  chain  rule  for  the  probability  that  A  and  B  both  obtain, 

Prob  (A  fl  B)   =  Prob  (A  |  B)  Prob  (B) 

(2) 
=  Prob  (B  |  A)  Prob  (A), 

which  is  barely,  if  at  all,  distinguishable  from  Bayes'  theorem.  For  a 
fixed  B,  Prob  (A  I  B)  ought  to  be  the  probability  you  would  associate 
with  events  A  if  your  knowledge  were  augmented  by  news  of  B  and  by 
nothing  more.  This  function  of  A  should  therefore  itself  form  a  coherent 
probability  system,  as  Prob  (A)  does.  The  anticipation  is  easily  verified 
by  means  of  the  chain  rule,  provided  Prob  (B)  is  positive. 

You  now  have  the  bare  bones  of  the  theory  of  personal  probability.  It 
pertains  to  events  and  to  an  idealized,  or  coherent,  person.  Nothing  in 
the  theory  says  that  all  coherent  people  are  the  same,  but  it  does  say 
that  there  are  certain  rules,  breach  of  which  is  incoherence.  And  pre- 
sumably you  don't  want  to  be  incoherent  in  that  way.  I  don't. 

The  principal  early  pioneers  of  personal  probability  are  Ramsey  [140], 
de  Finetti  [42,  45,  48],  Koopman  [96,  97,  98],  and  Good  [72,  73,  74]. 
Borel  [26],  Fry  [68],  Molina  [117,  118,  119],  and  I  [154]  are  among  those 
who  have  also  participated  in  various  ways.  Schlaifer  [163]  recently  has 
led  the  way  in  actual  applications  based  on  a  clear  theoretical  under- 
standing. An  interesting  recent  publication  is  that  of  Smith  [171]. 


OTHER  VIEWS 

There  are  other  alleged  probability  concepts  on  the  market.  Candidly, 
I  do  not  believe  that  they  are  clearly  defined,  or  definable.  But  this  talk 
need  not  support  so  radical  a  thesis.  It  is  enough  for  the  present  if  you 
see  that  the  personalistic  concept  is  meaningful  and  merits  exploration. 

The  overwhelmingly  popular  idea  of  probability  among  statisticians  to 
date  has  been  the  frequency  concept.  A  frequency  probability  is  sup- 
posed to  be  the  relative  frequency  of  some  kind  of  event  such  as  heads 
with  a  not  necessarily  fair  coin  or  twins  among  births.  It  is  supposed  to 
measure  an  objective  fact  of  nature.  Reflection  suggests  to  me  that  the 
concept  will  not  bear  examination.  Considering  its  great  popularity,  there 
has  been  surprisingly  little  effort  to  support  it  critically;  Refs.  [65,  144, 
178,  183]  which  can  be  cited,  by  no  means  pull  together.  A  careful  per- 
sonalistic analysis  of  those  situations  in  which  frequency  probability  i  s 
ordinarily  invoked  was  given  decades  ago  by  de  Finetti  [42]  but  has  re- 
ceived much  too  little  attention;  see  Section  3.7  of  Ref.  [154]  for  an 
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elementary  exposition  of  this  topic,  and  Refs.  [81]  and  [149]  for  recent 
mathematical  advances  in  it. 

Another  probability  concept  has  been  of  historical  importance  and  still 
fascinates  some  scholars.  This  I  call  the  necessary  (or  logical,  or  sym- 
metry) concept  of  probability.  According  to  this  concept,  probability  is 
much  like  personal  probability,  except  that  here  it  is  argued  or  postu- 
lated that  there  is  one  and  only  one  opinion  justified  by  any  body  of 
evidence,  so  that  probability  is  an  objective  logical  relationship  between 
an  event  A  and  the  evidence  B.  The  necessary  concept  has  been  unpopu- 
lar with  modern  statisticians  -  justifiably,  I  think.  The  leading  modern 
philosophical  exponents  of  the  necessary  view,  which  has  roots  in 
antiquity  [14,  15],  have  been  Keynes  [93]  and  Camap  [31,  32];  they  both 
tend  to  renounce,  or  at  least  to  temper,  the  necessary  view  in  the  latter 
parts  of  their  works.  Harold  Jeffreys  [86,  87]  is  best  described  as  a 
necessarist,  and  has  done  far  more  toward  applied  statistics  than  any 
other  of  that  persuasion.  His  works  are  a  source  of  invaluable  material 
for  us  personalistic  Bayesians;  we  have  some  changes  to  make,  but  they 
are  few  and  easy. 

What  I  believe  is  that  the  personalistic  concept  goes  to  the  heart  of 
all  applications  of  probability.  This  has  been  ably  demonstrated  by 
de  Finetti  [42,  45,  48].  In  my  opinion,  he  is  correct  in  his  thesis,  which 
he  sustains  well;  and  it  is  now  unjustified  to  repeat  over  and  over  again 
that  the  personal  view  must  be  wrong,  without  reading  something  of  what 
de  Finetti  and  others  have  said  about  it. 

The  other  views,  the  symmetry  view  and  the  frequency  view,  have 
analyses  in  personalistic  terms  that  make  plain  what  they  are  getting  at 
and  what  value  they  do  have.  That,  however,  is  not  the  burden  of  this 
afternoon's  talk.  If  it  makes  you  terribly  uncomfortable  to  think  that 
someone  is  saying  that  frequency  probability  is  nonsense,  though  you 
can  almost  touch  it  with  your  fingers,  then  hold  to  your  conviction  for 
the  present.  That  will  not  interfere  in  the  least  with  what  I  want  to  say; 
similarly  for  those  of  you  who  feel  that  there  must  be  some  sense  in  the 
symmetry  view.  There  is  some  sense  in  it,  which  can  ultimately  best  be 
seen  through  the  personalistic  position. 

Statistics,  as  you  know,  has  blossomed  forth  tremendously  in  the 
twentieth  century,  beginning,  I  suppose,  in  the  early  20' s.  One  great 
stimulus  to  this  growth  is  the  economic  one,  the  recognition  that  it  pays 
to  think  about  behavior  in  the  face  of  uncertainty.  Certain  kinds  of  ex- 
periments, such  as  expensive  and  time  consuming  ones  seem  particularly 
to  evoke  systematic  and  more  or  less  mathematical  thought  about  this 
problem.  Hand  in  hand  with  the  growth  of  statistics  has  gone  a  vigorous, 
emotional  adoption  of  frequency  probability  as  the  only  valid  kind.  The 
reason  for  this,  I  conjecture,  is  that  the  personal  concept  has  been  very 
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little  known.  Historically,  it  is  hard  to  find  a  champion  of,  a  book  about, 
or  even  a  paper  about,  the  theory  of  personal  probability  until  rather 
recently.  In  1924  there  was  an  obscurely  published  fragment  by  Borel 
[26];  in  the  late  1920*  s  Ramsey  wrote  and  buried  two  short  notes  about 
it,  which  were  published  in  [140]  in  1931;  and  the  thorough  paper  of  de 
Finetti  [42]  published  in  1937  is  still  too  little  known.  Thus,  statis- 
ticians really  had  no  one  to  extoll  personal  probability  to  them  when  the 
present  rapid  growth  of  statistics  was  beginning.  They  knew  that  the 
symmetry  kind  was  inoperable  as  far  as  they  were  concerned,  so  they 
committed  themselves  totally  to  frequency  probability,  which  must  have 
seemed  to  them  the  sole  alternative. 

For  this  commitment  they  paid  a  high  price.  For  whatever  frequency 
probability  means  or  tries  to  mean,  it  does  not  refer  to  ordinary  isolated 
events.  Almost  everyone  is  agreed  —  von  Mises  may  be  an  exception 
[183,  184]  -  that  it  is  impossible  for  frequentists  to  talk  about  the 
probability  that  this  chair  weighs  more  than  20  lb.  or  the  probability  that 
your  airplane  will  be  late  tonight.  That  meant  doing  without  —  especially 
doing  without  Bayes'  theorem.  To  illustrate,  you  might  perform  an  experi- 
ment to  see  whether  aspirin  makes  rabbits'  ears  curl,  trying  twenty  rab- 
bits with  aspirin  and  twenty  without  it.  You  can  put  them  through  the 
curlometer  at  standard  temperature  and  pressure,  but  when  all  is  said 
and  done,  the  question  of  whether  aspirin  is  efficacious  in  this  way  is 
not  supposed  to  be  one  that  admits  of  a  probability.  There  was  no  prior 
probability,  nothing  to  feed  into  Bayes'  theorem.  To  know  the  probability 
that  aspirin  makes  a  rabbit's  ears  curl  so  many  degrees,  you  should  feed 
into  Bayes'  theorem  how  probable  you  thought  that  was  before  you  began 
the  work  and  how  probable  these  data  would  be  if  the  drug  were  that 
efficacious.  The  first  of  these  factors  is  commonly  very  subjective  and 
personal,  in  my  view.  In  the  frequency  view,  it  just  isn't  there,  so  there 
are  no  inputs  and  therefore  no  outputs.  The  problem  for  the  frequentists 
was  to  make  inferences  without  really  making  inferences,  to  get  outputs 
without  inputs.  In  the  words  of  de  Finetti  [50,  p.  19],  "People  noticing 
difficulties  in  applying  Bayes'  theorem  (in  particular,  noticing  that  the 
symmetry  view  was  in  difficulty)  were  like  one  who  says,  'Since  it  is  not 
secure  to  build  on  sand,  it  will  eliminate  all  danger  to  take  away  the 
sand  and  build  on  the  void.'..." 

Half  a  century  of  building  on  the  void  has  produced  some  remarkable 
results.  There  have  been  ingenious  attempts  to  get  off  the  ground,  some 
of  which  are  unquestionably  of  lasting  value.  In  fact,  one  can  discern 
which  ones  are  and  which  ones  are  not,  as  will  be  illustrated  later.  Many 
of  them  are  as  spurious  as  any  bootstrap  machine. 

There  is  not  just  one  dominant  school  of  statistics.  All  but  a  few  of  us 
call  themselves  frequentists  of  one  school  or  another.  There  is  one  large 
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school  that  tends  to  be  called  the  Neyman-Pearson  School.  I  could  not 
say  to  just  what  extent  Jerzy  Neyman  or  Egon  Pearson  adheres  to  it; 
doubtless  the  present  views  of  these  two  leaders  are  somewhat  different 
from  each  other.  The  other  important  school  is  occupied  almost  uniquely 
by  R.  A.  Fisher;  if  numerically  small,  it  has  always  been  challenging  and 
stimulating. 

In  very  recent  years,  personal  probability  has  been  brought  to  bear  in 
statistics  [163,  164,  139].  One  way  of  looking  at  this  application  is  to 
see  it  as  adding  to  the  old  structure.  Bayes'  theorem  -  the  tempting 
formula  that  had  been  denied  inputs,  and  therefore  outputs,  for  decades  - 
now  has  inputs.  Thus  Bayesian  statistics  is  a  liberating  theory;  much 
that  couldn't  be  done  before  can  be  done  now. 

It  is  also  a  restraining  theory.  When  we  had  cut  ourselves  off  from 
Bayes'  theorem,  we  tried  other  things  that  sounded  plausible.  No  matter 
how  implausibly  they  turned  out  in  examples,  we  would  say,  "Well,  it  is 
an  unfortunate  example,  let  us  look  at  another  one."  Not  every  example 
was  unfortunate,  so  we  found  some  contentment.  Bayesian  theory  sys- 
tematically brings  up  criticism  of  some  of  these  methods,  and  we  are 
now  going  to  have  to  discard  them,  no  matter  how  pretty  they  are,  if  we 
accept  Bayesian  ideas.  Many  of  the  errors  that  the  Bayesian  theory  cor- 
rects had  been  discovered  before.  But  the  new  theory  sees  them  all  at 
once,  sees  why  they  occurred,  understands  them,  and  often  is  able  to  put 
something  useful  in  their  place.  Frequentists  who  have  recognized  these 
same  errors  in  the  past  have  seldom  been  able  to  correct  them  funda- 
mentally. 

SIMPLE  DICHOTOMY 

Let  me  tell  you  now  about  a  sort  of  small-scale  model  of  the  whole  of 
statistics.  This  is  a  famous  model  that  theoretical  statisticians  and 
teachers  of  statistics  are  always  turning  to  in  order  to  get  theoretical 
ideas  across  with  a  minimum  of  notation  and  complication.  The  model 
can  be  criticized  as  artificial;  I  shall  not  defend  it  or  even  try  to  find  a 
vivid  example  but  shall  just  talk  about  it  somewhat,  abstractly.  It  is  not 
too  abstract  to  comprehend  easily,  and  this  simple  model  generalizes  far 
and  well.  The  lessons  that  I  shall  underline  in  connection  with  it  are  not 
related  to  its  peculiar  use  of  the  number  two  or  anything  of  that  sort; 
they  really  are  quite  general  for  the  whole  of  statistics.  What  I  shall  do 
is  to  give  this  segment  of  the  talk  as  I  would  have  given  it  six  or  eight 
years  ago,  when  I  was  a  frequentist  (or  as  any  frequentist  might  give  it 
today),  up  to  the  point  where  something  new  can  be  done.  In  that  way  it 
will  become  clear  to  you  in  just  what  way  Bayesian  statistics  is  a 
step  forward. 
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The  set-up  is  this.  There  is  some  event,  some  scientific  hypothesis, 
such  as  the  dominance  of  some  gene,  that  either  obtains  or  does  not. 
Either  the  event  A  is  true  or  the  disjoint  event  B  is  true.  You  are  sure  of 
that,  or  if  you  are  not  sure  of  it,  you  are  at  least  willing  to  confine  dis- 
cussion to  it.  In  other  words,  while  you  may  admit  other  possibilities, 
you  may  be  interested  in  saying,  "Suppose  A  or  B  is  true,  then  what?" 
or  "Contingent  on  the  truth  of  A  or  B,  then  what?"  So  imagine  from  now 
on  that  either  A  or  B  must  be  true. 

You  are  in  a  position  to  gather  pertinent  evidence  as  to  which  is  true, 
to  do  some  experiment,  to  see  some  datum.  In  the  light  of  the  datum,  you 
are  required  to  make  a  decision,  and  I  here  artificially  assume  that  your 
range  of  decision  is  extremely  limited.  You  can  do  only  one  of  two 
things,  such  as  turn  right  or  turn  left,  or  buy  or  not  buy.  If  you  knew  that 
the  event  A  obtained  you  would  know  which  of  those  two  things  to  do; 
you  have  a  clear  preference  to  do  one  of  them  if  A  obtains  and  the  other 
if  B  does. 

If  you  adopt  a  decision  rule  — if  you  say,  for  example,  "I  shall  examine 
13  rabbits  taken  at  random  and  do  this  if  I  see  such-and-such  results  and 
do  that  otherwise."— then,  you  will  be  able  to  compute  the  probability  of 
each  action  given  each  hypothesis.  Quite  typically  you  will  be  in  accord 
with  others  about  this;  for  the  probability  of  the  datum  given  the  hypoth- 
esis is  often  relatively  public.  That  is  to  say,  though,  like  all  prob- 
abilities, these  are  personal  probabilities  many  informed  people  share 
them. 

Call  the  probability  of  doing  the  wrong  thing  if  A  obtains  a,  and  the 
probability  of  doing  the  wrong  thing  if  B  obtains  (3;  these  are  known 
as  the  errors  of  the  first  and  second  kind.  The  effect  of  a  decision  rule 
is  fully  described  by  the  errors  of  the  first  and  second  kind;  everyone  is 
agreed  upon  that.  Even  if  the  experiment  is  prescribed,  there  will  be 
many  different  a-/3  pairs  you  might  have  for  yourself,  because  you  can, 
given  the  experiment,  adopt  many  rules  as  to  whether  you  will  behave  as 
though  A  obtained  or  as  though  B  obtained.  Each  rule  results  in  an  a-(3 
pair,  and  the  set  of  all  a-/3  pairs  thus  made  available  is  a  convex  set  in 
the  a-j8  square,  symmetric  about  the  point  (1/2,  1/2),  and  including  the 
two  extreme  points  (1,  0)  and  (0,  1),  as  illustrated  in  Figure  1. 

The  asserted  convexity  depends  on  the  simplifying  and  innocuous 
assumption  that,  whatever  else  you  observe,  you  can  also  observe  some 
random  real  number  that  is  totally  irrelevant.  Thus,  if  you  have  available 
two  decision  rules  that  lead  to  (a',  /3')  and  (a",  /3"),  you  can,  with  the 
aid  of  your  random  number,  employ  the  first  rule  with  probability  p  and 
the  second  with  probability  (1  -  p).  This  itself  constitutes  a  decision 
rule,  and  it  corresponds  to  the  point  with  a  =  pa'+  (1  -  p)a"  and  ft  =  pfi' 
+  (1  -  p)fi",  which  shows  that  the  available  set  is  convex.  It  is  easy 
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Figure  1, 


indeed  to  see  that  (1,  0)  and  (0,  1)  are  available  and  that  if  (a,  £0  is 
available,  so  is  (1-/3,  1-a).  It  can  be  shown  that  every  convex  set 
satisfying  these  conditions  can  actually  arise. 

There  is  one  evident  principle  about  your  choice  among  any  set  of 
points  made  available  to  you  in  the  a-/3  plane  (by  an  experiment  or 
otherwise).  Since  a  and  /S  are  probabilities  of  errors,  it  is  good  to  move 
to  the  south  and  west  and  bad  to  move  to  the  north  and  east.  Formally, 
if  (a,  p)  is  different  from  (a',  /3'),  and  a  <  a'  and  jS  <  /3',  then  everyone 
will  prefer  (a,  /3)  to  (a',  /3').  This  is  the  principle  of  admissibility. 

Applied  to  the  situation  of  Figure  1,  the  principle  of  admissibility 
excludes  all  points  in  the  convex  set  of  available  points  except  those  on 
the  solid  curve.  These  are  well  known  to  correspond  to  the  likelihood- 
ratio-test  procedures,  that  is,  procedures  such  that,  for  some  positive  r 
(the  critical  likelihood  ratio  of  the  test),  you  act  appropriately  to  A  if 
Prob  (Datum  I  B)/Prob  (Datum  I  A)  <  r  and  appropriately  to  B  if  this 
likelihood  ratio  is  greater  than  r.  The  critical  likelihood  ratio  of  the  test 
pointed  out  by  Figure  1  is  the  negative  of  the  slope  of  the  dashed  tangent 
line.  For  more  details  see  Ref.  [154],  pp.  139,  212,  and  213. 
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This  much  is  standard  and  generally  agreed  upon,  and  at  this  point  the 
frequentist  pretty  much  tips  his  hat  and  says,  "Well,  I've  done  my  duty 
as  a  statistician.  It's  now  up  to  you  to  choose  from  among  the  available 
a-/3  points  the  admissible  one  you  like  best.,,  This  division  of  function 
is  often  made  by  the  frequentist  school.  Though  not  really  compelled  to 
by  their  philosophy,  they  do  have  a  way  of  making  it  clear  that  they  are 
the  statistician  and  you  are  the  fellow  who  pays;  so  if  you  make  a  mis- 
take, it  is  not  exactly  the  statistician's  mistake.  The  separation  between 
general  theory  and  an  application  is,  of  course,  a  valid  one.  But  for  the 
philosophy  of  an  application  of  statistics,  it  seems  better  not  to  separate 
the  roles  of  statistician  and  client.  For  the  most  part  the  statistician 
should  be  thought  of,  and  should  think  of  himself,  as  identical  with  the 
client.  This  is  like  von  Neumann's  concept  [185,  186]  that  partners  in  a 
game  of  bridge  constitute  a  two-headed  schizophrenic.  Fortunately,  when 
a  statistician  helps  a  laboratory  colleague,  they  don't  even  need  to  be 
very  schizophrenic;  each  can  know  what  the  other  is  thinking  to  what- 
ever extent  he  is  able  to  tell  about  it. 

In  giving  no  counsel  beyond  the  principle  of  admissibility,  the  fre- 
quentists  are  much  less  objectivistic,  or  objectively  inclined,  than  we 
personalists  are.  See  what  confusion  can  arise  from  the  inclination  to 
call  personalists  "subjectivists"  and  frequentists  "objectivists".  There 
results  the  awkward  tautology  that  the  objectivists  are  more  subjective 
than  the  subjectivists. 

When  Dennis  Lindley  and  I  first  discussed  simple  dichotomy  it  seemed 
to  us  that  frequentists  must  have  been  making  the  discouraging  assump- 
tion, perhaps  not  always  consciously,  that  the  preference  of  a  person 
among  a-/3  pairs  could  be  a  largely  arbitrary  one  like  the  preferences  of 
a  person  (according  to  usual  economic  theory)  among  various  rations  of 
bread  and  wine.  This  was  only  a  surmise  about  possible  attitudes  until 
the  appearance  of  an  interesting  paper  in  which  Lehmann  [103]  takes 
just  these  steps  and  turns  back  in  understandable  despair. 

But  Lehmann,  and  presumably  other  frequentists,  missed  an  important 
trick;  these  indifference  curves  are  not  really  an  arbitrary  family  of 
curves.  Economists  have  often  had  their  fingers  burned  trying  to  make 
hypotheses  about  families  of  indifference  curves,  and  it  is  now  a  doc- 
trine with  them  that  just  about  anything  can  happen.  With  a  and  /S,  how- 
ever, this  is  not  so. 

Let  me  show  you  one  or  two  simple  but  important  considerations. 
Suppose  you  happen  to  be  indifferent  between  the  two  points  P  and  Q  of 
Figure  2.  If  you  had  to  use  the  a-/3  pair  P  or  the  pair  Q,  you  wouldn't 
care  which.  Imagine  now  that  I  come  to  you  and  say,  "Let  me  choose  for 
you."  You  say,  "Why  not?  I'm  indifferent."  Then  I  say,  "Do  you  know 
what  I  am  going  to  do?  In  choosing  for  you,  I'm  going  to  flip  this  coin. 
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Figure  2. 


I'll  let  the  coin  choose."  Can  you  imagine  saying,  "Oh  no!  not  that, 
anything  but  that."?  I  think  not.  But  to  flip  the  (not  necessarily  fair) 
coin,  produces  out  of  P  and  Q  the  mixture  point  R.  That  is,  to  execute 
P  or  Q  with  prescribed  probabilities  is  to  execute  a  procedure  that 
operates  like  (and,  in  effect,  is)  the  collinear  point  R.  Thus,  you  see 
that  all  the  points  between  two  indifferent  points  are  indifferent.  You 
can  pursue  the  argument  a  few  more  steps,  suggested  by  the  figure,  to 
show  that  the  family  of  curves  is  not  much  of  a  family  of  curves  at  all; 
it  is  just  a  family  of  parallel  straight  lines.  Your  seemingly  vast  initia- 
tive is  reduced  to  the  choice  of  a  single  number;  more  flexibility  than 
that  you  do  not  want.  This  conclusion  is,  I  think,  eminently  reasonable; 
it  has  often  been  tested  and  has  never  successfully  been  challenged. 
A  great  simplicity  has  been  revealed;  and  I  would  say  that  this  is,  in 
microcosm,  the  characteristic,  or  one  of  the  characteristic,  advances  of 
the  Bayesian  theory.  But  you  might  say,  as  did  a  most  capable  theoretical 
statistician,  "I  accept  and  welcome  the  simplification,  but  you  didn't 
use  any  prior  probabilities."  True,  I  did  not  overtly  use  the  words  or 
expressions  of  personal  probability  in  making  this  demonstration.  On  the 
contrary,  what  we  have  here  is  an  imperfect  and  special  demonstration, 
from  first  principles,  of  the  cogency  of  personal  probability.  For  if  now 
you   imagine  that  in  making  your  decisions  about  A  and  B  something 
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tangible  is  at  stake  (for  instance,  that  either  error  will  cost  $10.00  if  you 
make  it),  then  you  see  easily  that  having  parallel  straight  lines  for  your 
indifference  curves  amounts  to  no  more  or  less  than  having  certain  odds 
for  one  hypothesis  against  the  other.  Specifically,  the  negative  of  your 
critical  slope  is  your  prior,  or  initial,  personal  odds  in  favor  of  B  against 
A,  when  the  two  errors  are  equally  expensive.  Insofar  as  this  demonstra- 
tion carries  weight,  it  makes  you  a  Bayesian,  whether  you  thought  you 
wanted  to  be  or  not. 

The  demonstration,  as  I  say,  shows  quite  characteristically  what  Bayes 
has  to  offer,  because  before  this  there  were  either  no  rules  or  rules  that 
were  offered  halfheartedly  in  the  knowledge  that  they  were  rather  absurd 
if  taken  literally.  Thus,  the  Bayesian  position  can  be  viewed  as  a 
natural  completion,  an  overlooked  step  in  the  classical  theory. 

I  like  this  example  very  much  but  claim  no  credit  for  its  invention.  I 
first  met  it  in  the  company  of  Dennis  Lindley,  when  he  and  I  worked  it 
out  together,  stimulated  by  his  paper  [106].  But  it  is  in  the  statistical 
air  and  can  be  found  in  many  places  and  in  many  forms  [25,  112,  139]. 


STABLE  ESTIMATION 

I  shall  now  say  a  little  about  an  extraordinarily  successful  chapter  of 
Bayesian  Statistics,  one  that  will  give  you  some  idea  how  to  put  Bayesian 
ideas  to  practical  work. 

Let  us,  to  take  a  concrete  and  simplified  example,  pretend  that  we 
have  a  serious  interest  in  the  weight  of  this  chair,  and  that  we  have  an 
excellent  balance  on  which  to  weigh  it.  A  balance  (imported  from  never- 
never  land)  known  by  us  to  deliver  normally  distributed  measurements 
with  a  standard  deviation  of  1  oz.  Formally,  we  are  going  to  look  at  an 
instrument  that  produced  a  number  D  with  a  probability  density  con- 
ditional on  the  actual  weight  W.  In  this  particular  case,  p(D  I  W)  =  (f>{D-W), 
where  cf>  is  the  standard,  normal  probability  density  <£(x)  =  (2n)~1/2  x  exp 
(-  x2/2).  Bayes'  theorem  tells  us  that,  after  we  have  weighed  the  chair, 

p(WID)  =  K(D)p(D\W)p(W).  (3) 

The   multiplier  K(D)  is  determined  by  the  condition  that  p{W  ID)  is  a 
probability  distribution  in  W  for  each  possible  observation  D  such  that 

Jp(W\D)(W   =  K(D)fp(D\W)p(W)dW   =  1  (4) 

At  this  point,  one  is  likely  to  lose  heart  and  say,  "I  know  p(D  I  W); 
that  was  postulated.  But  p{W)  is  anybody's  guess."  It  is  literally  any- 
body's  guess,  in  that  it  describes  the  opinion  of  somebody,  anybody. 
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Focus,  though,  on  your  own  opinion.  Still  even  your  own  opinion  is 
pretty  foggy  in  your  own  mind,  and  you  may  therefore  feel  that  it  is 
hopeless  to  get  a  usable  output  here.  Actually,  it  is  easy  to  do  so  if  you 
just  put  in  what  you  really  know  about  your  p(W).  The  most  useful  thing 
that  you  know  about  it  is  that  p{W)  changes  slowly  in  W.  To  illustrate, 
suppose  you  were  to  weigh  the  chair  on  a  very  reliable  printing  balance 
that  prints  out  the  weight  to  the  nearest  pound.  If  the  answer  printed  out 
were  27  lb,  what  odds  would  you  than  give  that  the  exact  weight  is  a 
little  below  27  lb  and  not  above  it?  Hardly  anyone  would  offer  odds 
perceptibly  different  from  one  to  one.  Of  course,  someone  might.  He 
might  happen  to  know  that  it  is  against  the  policy  of  the  plastic-chair- 
makers '  union  to  make  an  armchair  that  weighs  less  than  27  lb  3  oz.  Such 
a  person  feels  almost  certain  that  the  chair  weighs  more  than  27  lb  3  oz; 
he  doesn't  share  our  opinions.  Again,  there  may  be  somebody  who 
weighed  the  chair  twenty  minutes  ago  on  an  ordinary  (not  a  printing) 
balance.  His  p(W)  may  not  have  the  broad  gentle  shape  that  leads  most 
of  us  to  offer  even  odds  after  seeing  the  printed  reading.  But,  if  you  will 
ask  yourself  truthfully,  you  will  see  that  you  are  content  with  even  odds, 
which  is  symptomatic  that  the  logarithmic  derivative  of  your  initial  p(W) 
is  small  when  measured  in  units  of  an  ounce,  especially  in  the  neighbor- 
hood of  weights  that  you  consider  to  be  at  all  reasonable,  say  from  15  to 
45  lb. 

What  p(W)  does  in  unreasonable  neighborhoods  is  hard  to  say  but  is 
less  important.  It  may  be  that  if  you  had  to  bet  given  that  the  chair 
weighs  100  lb  1  oz  or  100  lb  2  oz,  you  might  give  ten  to  one  odds  in 
favor  of  the  slightly  lower  possibility,  because  although  both  are  pre- 
posterous, the  lower  one  is  conceivably  much  less  preposterous. 

In  the  reasonable  region,  I  repeat,  the  logarithmic  derivative  of  your 
p(W)  is  small.  You  will  also  be  able  to  verify,  I  think,  by  a  clear  self- 
interrogation  that  p(W)  is,  at  no  W,  more  than,  say,  10,000  times  greater 
than  its  minimum  in  the  region  you  consider  reasonable. 

With  these  assumptions  about  the  behavior  of  your  p(W),  return  to  the 
study  of  the  balance  with  a  normally  distributed  error  of  standard  devi- 
ation 1  oz  and,  in  particular,  to  the  application  of  Equation  (3).  What 
must  be  computed  is  the  shape  of  the  product  of  the  two  functions  p(W) 
and  p{D  I W)  for  some  specific,  reasonable  value  of  D,  say,  339.4  oz 
(=  21  lb  3.4  oz).  I  say  shape  of  the  product,  because  it  is  enough  to 
know  the  product  up  to  a  multiplicative  constant,  since  ultimately  the 
answer  must  be  normalized  to  make  its  integral  equal  to  1.  The  problem 
is  shown  graphically  in  Figure  3.  The  units  appropriate  to  the  vertical 
scale  in  Figure  3  are  dimensionally  different  for  the  two  functions  shown, 
but  this  is  of  no  consequence  in  computing  the  shape  of  the  product. 
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P{W) 


p{W\D)  =  <f>(659.5-W) 


+ 
339.4       340.4 

Figure  3. 

The  figure  suggests,  and  it  is  not  hard  for  analysis  to  show,  that  under 
the  assumptions  we  have  now  made,  the  shape  of  the  product  is  almost 
uninfluenced  by  the  exact  shape  of  p(W),  and  is  almost  the  same  as  the 
shape  of  p(W  ID). 

Thus,  honest  self-interrogation  has  led  you  to  the  conclusion  that 
p(W  I  D)  =  cp(D  I  W),  nearly.  In  this  case,  p(D  I  W)  =  cf>(D-W),  so  the  con- 
stant c  must  be  1.  To  illustrate,  when  you  weigh  a  chair  on  a  good 
balance  with  a  standard  deviation  of  1  oz,  you  will  ordinarily  be  ready  to 
give  about  two  to  one  odds  that  the  true  weight  is  within  an  ounce  of 
what  the  balance  says  and  to  give  something  like  19  to  1  odds  that  the 
true  weight  is  within  2  oz  of  what  the  balance  says.  Three  ounces  gets 
a  bit  exotic,  in  practice,  but  maybe  you  would  give  odds  of  several 
hundred  to  one  in  that  case.  This  constitutes  a  practical,  full,  and 
satisfying  solution  of  the  statistical  problem. 

It  has,  incidentally,  been  a  forbidden  solution.  If  you  have  studied 
statistics  formally,  you  have  probably  been  thoroughly  taught  as  follows: 
A  2-oz  interval  around  the  reading  of  the  balance  is  a  two-sigma  con- 
fidence interval;  before  you  weigh  the  chair,  the  chances  are  19  to  1  that 
this  interval  will  hit  the  true  weight;  after  you  have  weighed  the  chair, 
you  cannot  even  discuss  the  probability  that  it  has  hit  the  true  weight. 
However,  according  to  Bayesian  statistics,  you  can,  when  p(W)  is  gentle, 
talk  rather  accurately  about  this  probability,  the  posterior  probability 
that  the  true  weight  is  in  the  interval;  it  is  19  to  1.  For  those  of  you  to 
whom  this  issue  is  new,  let  me  borrow  something  that  John  Pratt  recently 
said  [138].  Namely,  the  theoreticians  who  put  forward  confidence  in- 
tervals, self-forbidden  to  attach  probabilities  to  the  conclusions  of  their 
inferences,  invented  something  new  to  say,  but  could  not  invent  anything 
new  to  think.  People  who  actually  use  confidence  intervals  do  attach 
confidence  to  them,  even  when  they  avow  that  they  ought  not.  The  in- 
tervals we  are  now  talking  about  really  do  merit  confidence,  under  the 
circumstances  envisaged.  Since  the  name  "confidence  interval* *  is  pre- 
empted, let  us  call  them  credible  intervals. 
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The  method  just  illustrated,  which  can  be  called  the  method  of  stable 
estimation,  is  easy  to  generalize  to  several  parameters  and  to  many 
kinds  of  problems.  There  are  also  useful  variations  of  it;  we  need  not 
always  assume  that  p(W)  itself  is  gentle,  and  it  is  sometimes  more  ap- 
propriate to  assume  that  p(W)  multiplied  by  some  function  of  W  is  gentle. 
All  the  textbook  estimation  problems  of  statistics  yield  to  the  method, 
the  hard  ones  as  well  as  the  easy  ones.  You  notice  that  in  the  chair- 
weighing  example  we  got  an  answer  that  is  somehow,  in  its  own  terms, 
the  same  as  the  usual  frequentist's  answer,  but  if  we  were  to  investigate 
binomial  and  Poisson  distributions,  we  would  get  slightly  different 
answers. 

The  Behrens-Fisher  problem  is  an  instance  in  which  the  method  of 
stable  estimation  brings  new  clarification.  This  problem,  at  least  in  one 
important  form,  is  to  make  inferences  about  the  difference  of  two  unknown 
parameters  \ix  and  \l2  on  the  basis  of  rt\  measurements  normally  distrib- 
uted about  fix  with  unknown  variance  o\  and  n2  measurements  about  \l2 
with  unknown  variance  a\,  Fisher  has  long  given  a  certain  answer  to 
this  problem;  see,  for  example,  Ref.  [65],  pp.  92-99.  However,  few  can 
understand  Fisher's  logic  in  this  connection  even  sufficiently  to  say 
what  his  answer  means.  Jeffreys  [85]  arrives  formally  at  the  same  an- 
swer, in  his  much  more  understandable  system,  and  his  derivation  can  be 
translated  easily  into  one  in  the  theory  of  stable  estimation.  Thus,  what 
Fisher  calls  the  fiducial  distribution  of  \ix  -  \l2  in  this  problem  can  be 
justified  as  an  approximation  to  the  posterior  distribution  of  \lx  -  \l2, 
under  widely  applicable  circumstances.  The  efforts  of  the  Neyman- 
Pe arson  school  to  deal  with  this  problem  —  see,  for.  example,  Refs. 
[160,  195,  196,  200]  -  have,  with  cause,  been  criticized  by  Fisher  (for 
example  on  pp.  96-99  of  Ref.  [65]).  The  idea  of  stable  estimation  is  any- 
thing but  new,  although  it  has  never  had  the  prominence  it  deserves. 
See,  for  example,  Ref.  [184],  Ref.  [186],  Section  3.4,  and  Ref.  [72],  p.  55. 

ILLUSTRATIVE  IMPLICATIONS 

Bayesian  statistics  is  rich  in  practical  implications.  Few  of  these  are 
wholly  new,  but  their  cumulative  and  systematic  effect  has  been  most 
illuminating.  The  Bayesian  theory  is  a  common  sense  theory.  At  any 
stage  in  it,  you  can  be  back  to  first  principles  in  five  minutes;  you  can 
always  see  and  test  each  claim  of  the  theory  quite  directly.  No  wonder, 
then,  that  many  of  the  things  it  has  to  say  have  been  seen  before  by 
experienced,  intuitive,  careful  people. 

One  of  the  most  remarkable  theses  in  conflict  with  the  Neyman-Pearson 
theory  was  first  put  forward  by  Barnard  [9]  and  by  Fisher  [64],  who  might 
not  like  to  be  called  personalists.  This  thesis  is  called  the  likelihood 
principle  and  I  can  best  explain  it  here  by  means  of  an  illustration. 
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Someone  comes  to  a  statistician  and  says,  "I  have  inspected  831  of 
these  vacuum  tubes,  and  17  of  them  are  defective.  What  can  I  conclude?" 
The  typical  frequentist  inquires,  "Did  you  plan  to  inspect  831  of  these? 
Or  did  you  perhaps  plan  to  inspect  until  you  saw  17  defective  tubes?' ' 
The  engineer  asks,  "What  difference  does  it  make?"  "Oh,"  says  the 
frequentist,  "It  makes  all  the  difference  in  the  world.  The  probability  of 
drawing  17  defectives  in  a  sample  of  831  when  the  true  frequency  of 
defectives  is  p  is 


© 


p17(l-p)814.  (5) 


But,  if  you  had  decided  to  look  until  you  accumulated  a  collection  of  17 
defectives,  the  probability  is 
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These  two  probabilities  are  not  even  approximately  equal;  the  first  ex- 
ceeds the  second  by  a  factor  of  831/17,  or  almost  50.* '  When  the  engineer 
confesses,  "To  be  honest  with  you,  I  quit  when  I  got  tired,"  or  "I  quit 
when  the  boss  came  along  and  said,  'Enough  statistics  -  we've  got  to 
get  the  product  out',"  the  frequentist  statistician  wonders  what  to  do.  If 
only  he  had  been  consulted  before  the  sample  was  drawn!  Suppose  how- 
ever, the  engineer  says,  "It  was  like  this.  My  boss  told  me  he  didn't 
think  there  were  even  three  defectives  per  thousand  and  I  sampled  until  I 
had  enough  to  make  him  listen  to  me."  The  statistician  is  shocked  [56]. 
Perhaps  he  gives  way  to  indignation:  "You  cheater,  you  perverter  of 
data,  you  stopped  at  your  own  pleasure,  optionally,  when  you  thought  you 
were  ahead.  That's  just  like  erasing  figures  from  a  laboratory  notebook, 
or  worse." 

But  Barnard  and  Fisher  say  (and  we  Bayesians  say  with  them)  "The 
only  thing  that  matters  for  the  inference  at  hand  is  the  factor  p17(l  -  p)814. 
No  one  need  care,  now  that  the  data  are  in,  what  factor  precedes  it." 

The  reason  is  quite  obvious,  from  a  Bayesian  point  of  view.  Namely, 
what  you  think  about  p  after  you  have  seen  the  data  is  p  (p  I  D)  = 
Kp(D\p)p{p)  =K'p17(l-p)814p(p).  The  constant  K'  is  determined  by 
normalization;  it  is  not  affected  by  any  factor  of  p(Dlp)  that  does  not 
depend  on  p.  The  general  principle  is  that  the  likelihood  function  of  p  for 
the  actual  data  D  is  all  that  counts  in  making  an  inference  about  p.  The 
likelihood  function,  as  here  defined,  is  p(D  I  p)  as  a  function  of  p  modulo 
(or  abstracting  from)  factors  that  do  not  depend  on  p.  Thus  p3  and  7p3 
define  the  same  likelihood  function. 

All  skirmishing  and  arguing  about  how  you  should  analyze  an  experi- 
ment if  it  was  done  sequentially,  and  how  you  should  analyze  it  if  it  was 
done  by  a  fixed  sample,  and  how  great  a  sin  it  is  to  do  optional  stopping 
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is  over  for  anyone  who  understands  and  believes  the  likelihood  principle. 
Quite  a  few  people  do;  and  no  one  has  presented  any  convincing  counter- 
examples. 

The  principle  is  a  surprise  to  traditional  statistics  and  is  in  real  con- 
flict with  it.  Traditionally,  the  engineer  might  set  out  to  convince  his 
boss  of  something  on  the  basis  of  a  significance  level,  and  it  is  true 
that,  with  optional  stopping,  he  would  sooner  or  later  exceed  any  signifi- 
cance level.  The  moral  is  not  that  you  should  refrain  from  optional  stop- 
ping, nor  even  that  you  should  analyze  an  experiment  involving  optional 
stopping  in  some  way  different  from  that  used  to  analyze  fixed  samples 
or  preannounced  sequential  samples.  It  is  that  significance  levels  do  not 
really  signify. 

The  likelihood  principle  makes  sequential  analysis  much  easier  to 
understand  and  appreciate,  and  it  makes  work  on  unbiased  estimates  for 
sequential  schemes  [71,  24]  irrelevant.  If  the  estimate  x/n  really  was 
good  enough  for  grandpa,  it  is  good  enough  for  us  too,  because  whether  it 
is  good  or  not  does  not  depend  at  all  on  whether  the  sampling  was  se- 
quential. 

Naturally,  the  likelihood  principle  applies  not  only  to  the  binomial 
distribution,  but  to  all  kinds  of  sequential  problems.  It  also  bears  on  cer- 
tain actuarial  estimations  of  a  product  of  frequencies  from  random  amounts 
of  data  on  each  factor;  the  non-Bayesian  literature  on  these  is  lead  to  by 
[55]  and  [90]. 

Long  ago  frequentists  posed  the  problem  of  a  program  that  would  lead 
to  confidence  intervals  of  prescribed  length,  say  one,  for  an  unknown 
mean  (i  based  on  normally  distributed  measurements  of  unknown  variance 
a2.  Charles  Stein  [172]  ingeniously  adduced  a  method  along  the  following 
lines:  First,  take  a  small  pilot  sample  of  measurements.  On  the  basis  of 
the  pilot  sample,  take  enough  additional  measurements  so  that  the  unit 
interval  centered  around  the  mean  of  all  the  measurements  will  be  a  con- 
fidence interval  at  the  required  confidence  level.  Technically,  the  method 
works,  especially  with  certain  refinements  incorporated  by  Stein;  but  the 
confidence  interval  thus  produced  may  not  warrant  confidence.  If,  for 
example,  the  second  sample  suggests  that  a2  is  much  larger  than  was 
indicated  by  the  first  sample,  no  one  will  be  much  inclined  to  trust  the 
unit  interval.  This  criticism  was  doubtless  made  early  by  frequentists; 
Stein  himself  may  well  have  been  one  of  the  first.  But  Bayesian  statistics 
has  a  straightforward  answer  to  the  problem  of  determining  a  unit  credible 
interval:  sample  until  the  central  credible  interval  associated  with  the 
accumulated  data  has  unit  length.  Of  course,  the  problem  itself  can  be 
criticized  as  economically  unreasonable.  Bayesian  theory  can  answer 
more  realistic  problems  too,  though  often  with  more  difficulty. 
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The  clarification  of  the  Behrens-Fisher  problem  by  Bayesian  statistics 
has  already  been  mentioned.  Bayesian  statistics  also  clarifies  a  conflict 
about  a  somewhat  similar  problem  and  its  extensions,  namely,  to  make 
inferences  about  the  ratio  of  unknown  constants  ft!  and  \l2  from  a  pair  of 
measurements  normally  distributed  about  y.x  and  \i2  with  unit  variance.  It 
is  helpful  to  think  of  this  as  the  problem  of  estimating  the  line  through 
the  origin  determined  by  an  unknown  vector.  A  natural  confidence  inter- 
val for  this  line  is  the  set  of  all  lines  through  the  origin  that  intersect  a 
circle  of  fixed  radius  R  about  the  sample  vector.  This  really  is  a  confi- 
dence interval,  since  it  includes  the  true  line  if,  and  only  if,  the  sample 
point  falls  within  R  units  of  that  line,  which  it  will  do  with  probability 
1  -2<J>(-R),  where  $  is  the  standard  normal  ogive.  The  interval  is  not, 
however,  a  credible  interval,  nor  (contrary  to  Fieller  [58])  a  fiducial  in- 
terval, as  is  plain  from  the  fact  that  it  includes  all  directions,  in  case 
the  sample  point  falls  within  R  units  of  the  origin.  Creasy  [58]  gives  a 
correct  discussion  of  the  fiducial  intervals,  which  can  also  be  seen  to  be 
credible  intervals. 

The  Bayesian  theory  of  hypothesis  testing  (which  owes  much  to  Jeffreys 
[86,  87])  is  quite  different  from  any  frequency  theory  known  to  me.  To 
illustrate,  no  one  is  too  ignorant  of  statistics  to  know  that  a  If  I  of  3  with 
lots  of  degrees  of  freedom  strongly  disproves  the  null  hypothesis.  I  wish 
we  didn't  all  know  it,  because  it  is  false.  Actually,  in  circumstances  not 
very  unusual,  a  If  I  of  3  is  mild  evidence  in  favor  of  the  null  hypothesis. 
This  cannot  be  explained  fully  here,  but  the  idea  can  be  sketched. 

Suppose  an  accurate  melting-point  experiment  is  conducted  to  determine 
whether  certain  synthesized  crystals  really  are  statistic  acid,  as  is  sus- 
pected. We  then  have  some  initial  odds  /  in  favor  of  the  null  hypothesis 
that  the  temperature  discrepancy  \l  is  0  (or  extremely  near  0);  given  that 
the  null  hypothesis  is  false,  our  distribution  of  \i  has  a  rather  gentle 
density  p((i\F).  After  a  normally  distributed  measurement  D  of  standard 
deviation  a2,  our  odds  in  favor  of  the  null  hypothesis  are  about 

l<f>(D/o)  (?) 


op{D  I  F) 


if  D/a  =  f  is  moderate,  say  less  than  5.  If  the  melting  point  determination 
was  undertaken  with  any  realistic  hope  of  very  strongly  supporting  the 
null  hypothesis,  then  the  coefficient  of  /  in  Equation  7  would  be  large, 
say  100,  in  the  most  favorable  possible  case,  t  =  0.  But  at  f  =  3,  it  will 
have  this  same  value  times  e"4,5  =  0.0111,  so  the  approximate  posterior 
odds  will  there  be  (100  x  0.0111)  J  =  1.11  J,  or  slightly  greater  than  /.  The 
frequentist,  impressed  by  how  improbable  it  was  that  If  I  should  be  as 
large  as  3  given  the  null  hypothesis,  is  shocked  at  this  conclusion;  it  is 
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against  his  principles  to  consider  how  improbable  it  was  that  If  I  should 
be  as  small  as  3  given  the  alternative  hypothesis. 


CONCLUSION 


Several  technical  references  have  been  made  but  it  may  be  useful  to 
suggest  some  systematic  reading  in  Bayesian  statistics.  Good's  book 
[72]  and  mine  [154]  contain  useful  preliminary  ideas  worked  out  in  some 
detail,  but  neither  is  really  in  the  current  spirit.  Schlaifer  has  written 
two  excellent  books,  which  are  mathematically  elementary  but  intellectu- 
ally mature  —  a  long  one  [163],  and  a  shorter  one  [164]  that  emphasizes 
the  statistical  parts  of  the  long  one.  Raiffa  and  Schlaifer  together  are  the 
authors  of  an  excellent  technical  book  on  the  subject  [139].  Ref.  [39] 
contains  various  interesting  contributions. 

Statistics  is  entering  a  period  of  swift  advance,  as  I  hope  to  have 
brought  out.  During  such  a  period,  no  one  is  an  expert.  Anyone  now  in- 
clined to  explore  statistical  theory  will  find  himself  on  the  ground  floor 
with  many  inviting  stairways  to  himself.  Those  of  you  who  use  statistics 
must  now,  to  an  unusual  extent,  rely  on  your  own  good  sense  and  under- 
standing, not  in  rules,  magic,  or  "powerful  new  tools." 
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